COS960: A Chinese Word Similarity Dataset of 960 Word Pairs

by   Junjie Huang, et al.
Beihang University
Tsinghua University

Word similarity computation is a widely recognized task in the field of lexical semantics. Most proposed tasks test on similarity of word pairs of single morpheme, while few works focus on words of two morphemes or more morphemes. In this work, we propose COS960, a benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness. We give a detailed description of dataset construction and annotation process, and test on a range of word embedding models. The dataset of this paper can be obtained from



There are no comments yet.


page 1

page 2

page 3

page 4


BCWS: Bilingual Contextual Word Similarity

This paper introduces the first dataset for evaluating English-Chinese B...

Construction of a Japanese Word Similarity Dataset

An evaluation of distributed word representation is generally conducted ...

A Critique of a Critique of Word Similarity Datasets: Sanity Check or Unnecessary Confusion?

Critical evaluation of word similarity datasets is very important for co...

WiC: 10,000 Example Pairs for Evaluating Context-Sensitive Representations

By design, word embeddings are unable to model the dynamic nature of wor...

Chinese Lexical Simplification

Lexical simplification has attracted much attention in many languages, w...

Unsupervised Learning of Style-sensitive Word Vectors

This paper presents the first study aimed at capturing stylistic similar...

Semantic Relatedness Based Re-ranker for Text Spotting

Applications such as textual entailment, plagiarism detection or documen...

Code Repositories


COS960: A Chinese Word Similarity Dataset of 960 Word Pairs

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word similarity computation is a task to automatically compute similarity score between given word pairs, which is the most popular way to evaluate quality of word embeddings. (Faruqui et al., 2016) The task evaluates the correlation between model computed similarities and human judgement, where the higher correlation is, the more semantic information is captured by the model (Bakarov, 2018).

There are a large number of diverse dataset constructed to evaluate word similarity, most of which in English. Rubenstein and Goodenough (1965) make an attempt to compute word similarities in order to test the distributional hypothesis (Harris, 1954) and construct the first dataset RG65 including a list of 65 pairs of nouns with their human annotated similarity scores in range of 0-4. After that a series of similarity datasets come out with unique charateristics, including:

(1) focusing on word relatedness: WordSim-353 (Finkelstein et al., 2001), YP-130 (Yang and Powers, 2006), MEN (Bruni et al., 2012), MTurk-287 (Radinsky et al., 2011), MTurk-771 (Halawi et al., 2012);

(2) focusing on word true simialrity: SimLex-999 (Hill et al., 2015) , Simverb3500 (Gerz et al., 2016), Verb-143 (Baker et al., 2014);

(3) in Chinese: WordSim-297(Jin and Wu, 2012), WordSim-240 (Wang et al., 2011), polysemous word (Guo et al., 2014), PKU-500 (Wu and Li, 2016);

(4) other highlights: two-word phrasal similarity (Mitchell and Lapata, 2010), rare words (Luong et al., 2013), words in sentential context (Huang et al., 2012), cross-lingual word similarity (Camacho-Collados et al., 2017),

In English, there are a number of datasets focusing on word true similarity which has wide applications on dictionary generation (Cimiano et al., 2005), machine translation (He et al., 2008; Marton et al., 2009) and language correction (Li et al., 2006). However, such a dataset focusing on word true similarity has been absent in Chinese for a long time. In addition, most of the datasets consist of single-word pairs, few of them consider the similarity of Multiword Expressions (MWEs) which is considered as a ”pain in the neck” (Sag et al., 2002)

for natural language processing (NLP).

Therefore, we introduce our COS960, a Chinese word similarity dataset of 960 word pairs, where each word is actually a two-word MWE. Each of the word pairs is annotated by 15 native speaker according to its true similarity rather than association. We also report the performance of a variety of word embeddings methods on our COS960 dataset. We hope our COS960 dataset can be helpful in NLP community.

2 Dataset Construction

2.1 Data Preparation

Word Selection

To make sure our word pairs of two morphemes are truly existing Chinese words, we use a famous linguistic knowledge base HowNet as the source of words. We extract the word whose two morphemes and itself all appear in HowNet and form a dataset of such triples in a total number of 51,034.

Then we split the dataset into four parts based on the POS tags of words, which are noun, verb, adjective and other. We use their POS tags annotated in HowNet and filter out the words which have more than one POS tags or no POS tag. The final number of each set is correspondingly. Here we only use the noun, verb and adjective sets.

Word Pair Generation

We pair the words in the each of the three above-mentioned sets pair by pair. Then we calculate the cosine similarity of each pair based on the word embeddings learned by GloVe

(Pennington et al., 2014) in Sogou-T corpus 111Sogou-T is a corpus of web pages containing 2.7 billion words.

, and the dimension of word vectors is 200.

We further divide the word sets with three POS tags into five parts respectively according to the similarity range, including . Note that we don’t take word pairs with cosine similarity lower than 0.4 into account because almost all the them are not really similar to each other. The number of word pairs in each set is shown in Table 1. Finally, we obtain 480 noun pairs, 240 verb pairs and 240 verb pairs.

noun-noun verb-verb adjective-adjective
96 48 48
96 48 48
96 48 48
96 48 48
96 48 48
total 480 240 240
Table 1: Number of MWE pairs with different cosine similarities in three sets.

2.2 Annotation Details

The total 960 pairs are randomly shuffled and divided into two parts, each of which contains 480 pairs of data. We recruit 30 native university students, and each of them is asked to annotate 480 pairs of words. Annotators are shown the definitions of each word and the categories in TongYiCiCiLin as the reference and are asked to rate a similarity score in a range of 0-4 for each word pair.

Before formal annotation, annotators are asked to read the Annotation Guidebook which presents the differences of similarity and relatedness. To improve annotation quality, they are obliged to take an exam before annotating COS960, which consists of at least two word pairs for each POS tag and similarity level (35 in total).

During the process of annotation, they are welcome to discuss and raise questions when they are hesitating, which helps to advance the consistency of different annotation and improve annotation quality.

2.3 Post-processing

We calculate the Krippendorff’s alpha between each two of the annotators and all their annotation is accepted. Finally, we use the average score of a single pair as the final similarity score and form our COS960.

3 Experiment

In this section, we provide experimental results of several existing word embedding models on our COS960 dataset.

3.1 Experimental Settings

To evaluate our COS960, we choose some typical word embedding models to test including: (1) Skip-Gram Mikolov et al. (2013); (2) continuous-bag-of-words (CBOW) Mikolov et al. (2013); (3) GloVe Pennington et al. (2014); (4) CWE, a character-enhanced word embedding Chen et al. (2015); (5) fasttext, enriched word vectors with subword information Bojanowski et al. (2016)

; (6) cw2vec, a chinese embedding with stroke n-gram information

Cao et al. (2018)

. For hyper-parameters, we set training epochs of every model to 5 and maintain the other default parameters of each model.

For evaluation protocol, we calculate the Pearson correlation coefficient, Spearman’s rank correlation coefficient and the square root of Pearson and Spearman’s rank correlation between cosine similarities of word pairs computed by word embeddings of models and human-annotated scores.

3.2 Experimental Results

Overall Results

The overall evaluation results on COS960 are shown in Table 2. From the table, we observe that:

Spearman’s Pearson Square-Mul
Skip-Gram 76.2 71.0 73.6
CBOW 78.2 72.1 75.1
GloVe 75.0 72.0 73.5
CWE 72.1 65.9 69.0
cw2vec 75.4 68.1 71.7
fasttext 75.5 70.0 72.7
Table 2: Spearman’s rank correlation coefficient () between similarity scores assigned by compositional models with human ratings on all 960 pairs of words.

(1) CBOW achieves the best performance, which is better than the second best model by 2.1% on average.

(2) All six methods have considerably high correlation scores with three evaluation protocols. This indicates that the cosine similarity of six evaluated word embeddings still correlates well with word true similarity, which contradicts with Hill et al. (2015).

(3) All six methods achieve highest score with the evaluation protocol of Spearman’s rank correlation. We attribute it to high annotation consistency that there are often more one word pairs in each similarity level.

Effect of POS tags

Spearman’s Pearson Square-Mul
Skip-Gram 74.5 66.8 70.5
CBOW 77.0 69.7 73.2
GloVe 73.7 68.6 71.1
CWE 74.2 64.2 69.0
cw2vec 73.7 64.8 69.1
fasttext 74.9 66.4 70.5
Table 3: Spearman’s rank correlation coefficient () between similarity scores assigned by compositional models with human ratings on all 480 pairs of nouns.

We further present the performance of on COS960 in three POS tags, i.e. nouns in Table 3, verbs in Table 4 and adjectives in Table 5.

Spearman’s Pearson Square-Mul
Skip-Gram 83.2 81.1 82.1
CBOW 84.8 80.7 82.7
GloVe 78.5 78.1 78.3
CWE 78.1 76.6 77.3
cw2vec 82.5 78.1 80.3
fasttext 82.9 80.5 81.7
Table 4: Spearman’s rank correlation coefficient () between similarity scores assigned by compositional models with human ratings on all 240 pairs of verbs.
Spearman’s Pearson Square-Mul
Skip-Gram 80.0 77.0 78.5
CBOW 78.5 74.4 76.4
GloVe 77.7 76.8 77.1
CWE 71.6 67.9 69.8
cw2vec 77.0 70.5 73.7
fasttext 78.8 76.1 77.5
Table 5: Spearman’s rank correlation coefficient () between similarity scores assigned by compositional models with human ratings on all 240 pairs of adjectives.

From Table 3, 4 and 5, we find that:

(1) CBOW still performs best in nouns and verbs, which is consistent with overall results,

(2) Models have best average performance on verb pairs while perform worst on noun pairs.

4 Conclusion

In this paper we propose COS960, a Chinese word similarity dataset of 960 word pairs, where all selected words are MWEs with two component words. We also describe the process of the dataset construction in detail and perform evaluation on existing word embedding models. We hope this dataset will contribute to the development of distributional semantics in Chinese.


  • Bakarov (2018) Amir Bakarov. 2018. A Survey of Word Embeddings Evaluation Methods. arXiv preprint arXiv:1801.09536.
  • Baker et al. (2014) Simon Baker, Roi Reichart, and Anna Korhonen. 2014. An Unsupervised Model for Instance Level Subcategorization Acquisition. In proceeding of EMNLP.
  • Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606.
  • Bruni et al. (2012) Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional Semantics in Technicolor. In proceeding of ACL.
  • Camacho-Collados et al. (2017) José Camacho-Collados, Mohammad Taher Pilehvar, Nigel Collier, and Roberto Navigli. 2017. SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity. In proceeding of SemEval@ACL.
  • Cao et al. (2018) Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information. In proceeding of AAAI.
  • Chen et al. (2015) Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huan-Bo Luan. 2015. Joint Learning of Character and Word Embeddings. In proceeding of IJCAI.
  • Cimiano et al. (2005) Philipp Cimiano, Andreas Hotho, and Steffen Staab. 2005. Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis. J. Artif. Intell. Res., 24:305–339.
  • Faruqui et al. (2016) Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, and Chris Dyer. 2016. Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. In proceeding of RepEval@ACL.
  • Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing Search in Context: the Concept Revisited. In proceeding of WWW.
  • Gerz et al. (2016) Daniela Gerz, Ivan Vulic, Felix Hill, Roi Reichart, and Anna Korhonen. 2016. SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity. In proceeding of EMNLP.
  • Guo et al. (2014) Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning Sense-specific Word Embeddings By Exploiting Bilingual Resources. In proceeding of COLING.
  • Halawi et al. (2012) Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale Learning of Word Relatedness with Constraints. In proceeding of KDD.
  • Harris (1954) Zellig S. Harris. 1954. Distributional Structure. WORD, pages 146–162.
  • He et al. (2008) Xiaodong He, Mei Yang, Jianfeng Gao, Patrick Nguyen, and Robert Moore. 2008. Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems. In proceeding of EMNLP.
  • Hill et al. (2015) Felix Hill, Roi Reichart, and Anna Korhonen. 2015.

    SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation.

    Computational Linguistics, 41:665–695.
  • Huang et al. (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In proceeding of ACL.
  • Jin and Wu (2012) Peng Jin and Yunfang Wu. 2012. SemEval-2012 Task 4: Evaluating Chinese Word Similarity. In proceeding of SemEval@NAACL-HLT.
  • Li et al. (2006) Mu Li, Muhua Zhu, Yang Zhang, and Ming Zhou. 2006. Exploring Distributional Similarity Based Models for Query Spelling Correction. In proceeding of ACL.
  • Luong et al. (2013) Thang Luong, Richard Socher, and Christopher D. Manning. 2013.

    Better Word Representations with Recursive Neural Networks for Morphology.

    In proceeding of CoNLL.
  • Marton et al. (2009) Yuval Marton, Chris Callison-Burchs, and Philip Resnik. 2009. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In proceeding of EMNLP.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In proceeding of NIPS.
  • Mitchell and Lapata (2010) Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive science, 34 8:1388–429.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In proceeding of EMNLP.
  • Radinsky et al. (2011) Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A Word at a Time: Computing Word Relatedness Using Temporal Semantic Analysis. In proceeding of WWW.
  • Rubenstein and Goodenough (1965) Herbert Rubenstein and John B. Goodenough. 1965. Contextual Correlates of Synonymy. Commun. ACM, 8:627–633.
  • Sag et al. (2002) Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, and Dan Flickinger. 2002. Multiword Expressions: A Pain in the Neck for NLP. In CICLing.
  • Wang et al. (2011) Xiang Wang, Yan Jia, Bin Zhou, Zhao Yun Ding, and Zheng Liang. 2011. Computing Semantic Relatedness Using Chinese Wikipedia Links and Taxonomy. Journal of Chinese Computer Systems, 32(11):2237–2242.
  • Wu and Li (2016) Yunfang Wu and Wei Li. 2016. Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Similarity Measurement. In NLPCC/ICCPOL.
  • Yang and Powers (2006) Dongqiang Yang and David M. W. Powers. 2006. Verb Similarity on the Taxonomy of WordNet.