COS960
COS960: A Chinese Word Similarity Dataset of 960 Word Pairs
view repo
Word similarity computation is a widely recognized task in the field of lexical semantics. Most proposed tasks test on similarity of word pairs of single morpheme, while few works focus on words of two morphemes or more morphemes. In this work, we propose COS960, a benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness. We give a detailed description of dataset construction and annotation process, and test on a range of word embedding models. The dataset of this paper can be obtained from https://github.com/thunlp/COS960.
READ FULL TEXT VIEW PDFCOS960: A Chinese Word Similarity Dataset of 960 Word Pairs
Word similarity computation is a task to automatically compute similarity score between given word pairs, which is the most popular way to evaluate quality of word embeddings. (Faruqui et al., 2016) The task evaluates the correlation between model computed similarities and human judgement, where the higher correlation is, the more semantic information is captured by the model (Bakarov, 2018).
There are a large number of diverse dataset constructed to evaluate word similarity, most of which in English. Rubenstein and Goodenough (1965) make an attempt to compute word similarities in order to test the distributional hypothesis (Harris, 1954) and construct the first dataset RG65 including a list of 65 pairs of nouns with their human annotated similarity scores in range of 0-4. After that a series of similarity datasets come out with unique charateristics, including:
(1) focusing on word relatedness: WordSim-353 (Finkelstein et al., 2001), YP-130 (Yang and Powers, 2006), MEN (Bruni et al., 2012), MTurk-287 (Radinsky et al., 2011), MTurk-771 (Halawi et al., 2012);
(2) focusing on word true simialrity: SimLex-999 (Hill et al., 2015) , Simverb3500 (Gerz et al., 2016), Verb-143 (Baker et al., 2014);
(3) in Chinese: WordSim-297(Jin and Wu, 2012), WordSim-240 (Wang et al., 2011), polysemous word (Guo et al., 2014), PKU-500 (Wu and Li, 2016);
(4) other highlights: two-word phrasal similarity (Mitchell and Lapata, 2010), rare words (Luong et al., 2013), words in sentential context (Huang et al., 2012), cross-lingual word similarity (Camacho-Collados et al., 2017), et.al.
In English, there are a number of datasets focusing on word true similarity which has wide applications on dictionary generation (Cimiano et al., 2005), machine translation (He et al., 2008; Marton et al., 2009) and language correction (Li et al., 2006). However, such a dataset focusing on word true similarity has been absent in Chinese for a long time. In addition, most of the datasets consist of single-word pairs, few of them consider the similarity of Multiword Expressions (MWEs) which is considered as a ”pain in the neck” (Sag et al., 2002)
for natural language processing (NLP).
Therefore, we introduce our COS960, a Chinese word similarity dataset of 960 word pairs, where each word is actually a two-word MWE. Each of the word pairs is annotated by 15 native speaker according to its true similarity rather than association. We also report the performance of a variety of word embeddings methods on our COS960 dataset. We hope our COS960 dataset can be helpful in NLP community.
To make sure our word pairs of two morphemes are truly existing Chinese words, we use a famous linguistic knowledge base HowNet as the source of words. We extract the word whose two morphemes and itself all appear in HowNet and form a dataset of such triples in a total number of 51,034.
Then we split the dataset into four parts based on the POS tags of words, which are noun, verb, adjective and other. We use their POS tags annotated in HowNet and filter out the words which have more than one POS tags or no POS tag. The final number of each set is correspondingly. Here we only use the noun, verb and adjective sets.
We pair the words in the each of the three above-mentioned sets pair by pair. Then we calculate the cosine similarity of each pair based on the word embeddings learned by GloVe
(Pennington et al., 2014) in Sogou-T corpus 111Sogou-T is a corpus of web pages containing 2.7 billion words. https://www.sogou.com/labs/resource/t.php, and the dimension of word vectors is 200.
We further divide the word sets with three POS tags into five parts respectively according to the similarity range, including . Note that we don’t take word pairs with cosine similarity lower than 0.4 into account because almost all the them are not really similar to each other. The number of word pairs in each set is shown in Table 1. Finally, we obtain 480 noun pairs, 240 verb pairs and 240 verb pairs.
noun-noun | verb-verb | adjective-adjective | |
96 | 48 | 48 | |
96 | 48 | 48 | |
96 | 48 | 48 | |
96 | 48 | 48 | |
96 | 48 | 48 | |
total | 480 | 240 | 240 |
The total 960 pairs are randomly shuffled and divided into two parts, each of which contains 480 pairs of data. We recruit 30 native university students, and each of them is asked to annotate 480 pairs of words. Annotators are shown the definitions of each word and the categories in TongYiCiCiLin as the reference and are asked to rate a similarity score in a range of 0-4 for each word pair.
Before formal annotation, annotators are asked to read the Annotation Guidebook which presents the differences of similarity and relatedness. To improve annotation quality, they are obliged to take an exam before annotating COS960, which consists of at least two word pairs for each POS tag and similarity level (35 in total).
During the process of annotation, they are welcome to discuss and raise questions when they are hesitating, which helps to advance the consistency of different annotation and improve annotation quality.
We calculate the Krippendorff’s alpha between each two of the annotators and all their annotation is accepted. Finally, we use the average score of a single pair as the final similarity score and form our COS960.
In this section, we provide experimental results of several existing word embedding models on our COS960 dataset.
To evaluate our COS960, we choose some typical word embedding models to test including: (1) Skip-Gram Mikolov et al. (2013); (2) continuous-bag-of-words (CBOW) Mikolov et al. (2013); (3) GloVe Pennington et al. (2014); (4) CWE, a character-enhanced word embedding Chen et al. (2015); (5) fasttext, enriched word vectors with subword information Bojanowski et al. (2016)
; (6) cw2vec, a chinese embedding with stroke n-gram information
Cao et al. (2018). For hyper-parameters, we set training epochs of every model to 5 and maintain the other default parameters of each model.
For evaluation protocol, we calculate the Pearson correlation coefficient, Spearman’s rank correlation coefficient and the square root of Pearson and Spearman’s rank correlation between cosine similarities of word pairs computed by word embeddings of models and human-annotated scores.
The overall evaluation results on COS960 are shown in Table 2. From the table, we observe that:
Spearman’s | Pearson | Square-Mul | |
Skip-Gram | 76.2 | 71.0 | 73.6 |
CBOW | 78.2 | 72.1 | 75.1 |
GloVe | 75.0 | 72.0 | 73.5 |
CWE | 72.1 | 65.9 | 69.0 |
cw2vec | 75.4 | 68.1 | 71.7 |
fasttext | 75.5 | 70.0 | 72.7 |
(1) CBOW achieves the best performance, which is better than the second best model by 2.1% on average.
(2) All six methods have considerably high correlation scores with three evaluation protocols. This indicates that the cosine similarity of six evaluated word embeddings still correlates well with word true similarity, which contradicts with Hill et al. (2015).
(3) All six methods achieve highest score with the evaluation protocol of Spearman’s rank correlation. We attribute it to high annotation consistency that there are often more one word pairs in each similarity level.
Spearman’s | Pearson | Square-Mul | |
Skip-Gram | 74.5 | 66.8 | 70.5 |
CBOW | 77.0 | 69.7 | 73.2 |
GloVe | 73.7 | 68.6 | 71.1 |
CWE | 74.2 | 64.2 | 69.0 |
cw2vec | 73.7 | 64.8 | 69.1 |
fasttext | 74.9 | 66.4 | 70.5 |
We further present the performance of on COS960 in three POS tags, i.e. nouns in Table 3, verbs in Table 4 and adjectives in Table 5.
Spearman’s | Pearson | Square-Mul | |
Skip-Gram | 83.2 | 81.1 | 82.1 |
CBOW | 84.8 | 80.7 | 82.7 |
GloVe | 78.5 | 78.1 | 78.3 |
CWE | 78.1 | 76.6 | 77.3 |
cw2vec | 82.5 | 78.1 | 80.3 |
fasttext | 82.9 | 80.5 | 81.7 |
Spearman’s | Pearson | Square-Mul | |
Skip-Gram | 80.0 | 77.0 | 78.5 |
CBOW | 78.5 | 74.4 | 76.4 |
GloVe | 77.7 | 76.8 | 77.1 |
CWE | 71.6 | 67.9 | 69.8 |
cw2vec | 77.0 | 70.5 | 73.7 |
fasttext | 78.8 | 76.1 | 77.5 |
(1) CBOW still performs best in nouns and verbs, which is consistent with overall results,
(2) Models have best average performance on verb pairs while perform worst on noun pairs.
In this paper we propose COS960, a Chinese word similarity dataset of 960 word pairs, where all selected words are MWEs with two component words. We also describe the process of the dataset construction in detail and perform evaluation on existing word embedding models. We hope this dataset will contribute to the development of distributional semantics in Chinese.
SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation.
Computational Linguistics, 41:665–695.Better Word Representations with Recursive Neural Networks for Morphology.
In proceeding of CoNLL.
Comments
There are no comments yet.