CA-EHN: Commonsense Word Analogy from E-HowNet

08/20/2019 ∙ by Li Peng-Hsuan, et al. ∙ 0

Word analogy tasks have tended to be handcrafted, involving permutations of hundreds of words with dozens of relations, mostly morphological relations and named entities. Here, we propose modeling commonsense knowledge down to word-level analogical reasoning. We present CA-EHN, the first commonsense word analogy dataset containing 85K analogies covering 5K words and 6K commonsense relations. This was compiled by leveraging E-HowNet, an ontology that annotates 88K Chinese words with their structured sense definitions and English translations. Experiments show that CA-EHN stands out as a great indicator of how well word representations embed commonsense structures, which is crucial for future end-to-end models to generalize inference beyond training corpora. The dataset is publicly available at <https://github.com/jacobvsdanniel/CA-EHN>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

CA-EHN

Commonsense Word Analogy from E-HowNet


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Commonsense reasoning is fundamental for natural language agents to generalize inference beyond their training corpora. Although the natural language inference (NLI) task [1, 2] has proved a good pre-training objective for sentence representations [3], commonsense coverage is limited and most models are still end-to-end, relying heavily on word representations to provide background world knowledge.

Therefore, we propose modeling commonsense knowledge down to word-level analogical reasoning. In this sense, existing analogy benchmarks are lackluster. For Chinese analogy (CA), the simplified Chinese dataset CA8 [4] and the traditional Chinese dataset CA-Google [5] translated from the English [6] contain only a few dozen relations, most of which are either morphological, e.g., a shared prefix, or about named entities, e.g., capital-country.

However, commonsense knowledge bases such as WordNet [7] and ConceptNet [8]

have long annotated relations in our lexicon. Among them, E-HowNet

[5], extended from HowNet [9], currently annotates 88K traditional Chinese words with their structured definitions and English translations.

In this paper, we propose an algorithm for the extraction of accurate commonsense analogies from E-HowNet. We present CA-EHN, the first commonsense analogy dataset containing 85,226 analogies covering 5,563 words and 6,490 commonsense relations.

2 E-HowNet

E-HowNet 2.0 consists of two major parts: A lexicon of words and concepts with multi-layered annotations, and a taxonomy of concepts with attached word senses.

2.1 Lexicon

The E-HowNet lexicon consists of two types of tokens: 88K words and 4K concepts. Words and concepts are distinguished by whether there is a vertical bar and an English string in the token. For example, 人 (person) and 雞 (chicken) are words, and human人 and 雞chicken are concepts. In this work, the order of English and Chinese within a concept does not matter. In addition, E-HowNet also contains dozens of relations, which comes fully in English, e.g., or, theme, telic.

Words and concepts in E-HowNet are annotated with one or more structured definitions consisting of concepts and relations. Table 1 provides several examples with gradually increasing complexity: 人 (person) is defined simply as a human人; 駿馬ExcellentSteed is defined as a 馬horse which has a qualification relation with HighQuality優質; 實驗室 (laboratory) is defined as a InstitutePlace場所 used for conducting experiments or research. Each concept has only one definition, but a word may have multiple senses and hence multiple definitions. In this work, we use E-HowNet word sense definitions to extract commonsense analogies (Section 3). In addition, word senses are annotated with their English translations, which could be used to transfer our extracted analogies to English multi-word expressions (MWE).

2.2 Taxonomy

Concepts in E-HowNet are additionally organized into a taxonomy. Figure 1 shows the partially expanded tree. Each word sense in the E-HowNet lexicon is attached to a taxon in the tree. In this work, we show that infusing E-HowNet taxonomy tree into word embeddings boosts performance across benchmarks (Section 4.3).

Token Type
人 (person) word
#1:{human人}
駿馬ExcellentSteed concept
#1:{馬horse:qualification={HighQuality優質}}
實驗室 (laboratory) word
#1:{InstitutePlace場所:telic={or({experiment實驗
:location={~}},{research研究:location={~}})}}
Table 1: E-HowNet lexicon
Figure 1: E-HowNet taxonomy

3 Commonsense Analogy

We extract commonsense word analogies with a rich coverage of words and relations by comparing word sense definitions. The extraction algorithm is further refined with multiple filters, including putting the linguist in the loop.

3.1 Analogical Word Pairs

Illustrated in Figure 2, the extraction algorithm before refinement consists of five steps.

Definition concept expansion. As many words are synonymous with some concepts, many word senses are defined trivially by one concept. For example, the definition of 駿馬 (excellent steed) is simply {駿馬ExcellentSteed}. The triviality is resolved by expanding such definitions by one layer, e.g., replacing {駿馬ExcellentSteed} with {馬horse:qualification={HighQuality優質}}, i.e., the definition of 駿馬ExcellentSteed.

Definition string parsing. We parse each definition into a directed graph. Each node in the graph is either a word, a concept, or a function relation, e.g., or() at the bottom of Table 1. Each edge is either an attribute relation edge, e.g., :telic=, or a dummy argument edge connecting a function node with one of its argument nodes.

Definition graph comparison. For every sense pair of two different words in the E-HowNet lexicon, we determine if their definition graphs differ only in one concept node. If they do, the two (word, concept) pairs are analogical to one another. For example, since the graph of 良材 sense#2 (the good timber) and the expanded graph of 駿馬 sense#1 (an excellent steed) differs only in wood木 and 馬horse, we extract the following concept analogy: 良材:wood木=駿馬:馬horse.

Left concept expansion. For each concept analogy, we expand the left concept to those words that have one sense defined trivially by it. For example, there is only one word 木頭 (wood) defined as {wood木}. Thus after expansion, there is still only one analogy: 良材:木頭=駿馬:馬horse. Most of the time, this step yields multiple analogies per concept analogy.

Right concept expansion. Finally, the remaining concept in each analogy is again expanded to the list of words with a sense trivially defined by it. However, this time we do not use them to form multiple analogies. Instead, the word list is kept as a synset. For example, as 山馬 (orohippus), 馬 (horse), 馬匹 (horses), 駙 (side horse) all have one sense defined as {馬horse}, the final analogy becomes 良材:木頭=駿馬:{山馬,馬,馬匹,駙}. When evaluating embeddings on our benchmark, we consider it a correct prediction as long as it belongs to the synset.

3.2 Accurate Analogy

As the core procedure yields an excessively large benchmark, added to the fact that E-HowNet word sense definitions are sometimes inaccurate, we made several refinements to the extraction process.

Concrete concepts. As we found that E-HowNet tends to provide more accurate definitions for more concrete concepts, we require words and concepts at every step of the process to be under physical物質, which is one layer below thing萬物 in Figure 1. This restriction shrinks the benchmark by half.

Common words. At every step of the process, we require words to occur at least five times in ASBC 4.0 [10], a segmented traditional Chinese corpus containing 10M words from articles between 1981 and 2007. This eliminates uncommon, ancient words or words with synonymous but uncommon, ancient characters. As shown in Table 3, the benchmark size is significantly reduced by this restriction.

Linguist checking. We added two data checks into the extraction process between definition graph comparison and left concept expansion. As shown in Table 3, each of the 36,100 concept analogies were checked by a linguist, leaving 24,439 accurate ones. Furthermore, each synset needed in the 24,439 concept analogies was checked again to remove words that are not actually synonymous with the defining concept. For example, 花草, 山茶花, 薰衣草, 鳶尾花 are all common words with a sense defined trivially as {FlowerGrass花草}. However, the last three (camellia, lavender, iris) are not actually synonyms but hyponyms to the concept. This step also helps eliminate words in a synset that are using their rare senses, as we do not expect embeddings to encode those senses without word sense disambiguation (WSD). After the second-pass linguist check, we arrived at 85,226 accurate analogies.

Figure 2: Commonsense analogy extraction
word : (concept) word word : (concept) synset
滴答 (tick-tock) (時鐘clock) 鼕鼕 (rat-tat) (drum)
時鐘 (clock) {}
聾子 (deaf person) (耳朵ear) 瞎子 (blind person) (eye)
(ear) {, 眸子, , 眼眸, 眼睛}
外公 (maternal grandfather) (mother) 祖父 (paternal grandfather) (father)
母親 (mother) {, 父親, , 爸爸, , 爹爹, 老子}
蝌蚪 (tadpole) (青蛙frog) 孑孓 (wriggler) (蚊子mosquito)
青蛙 (frog) {斑蚊,, 蚊子, 蚊蟲}
Table 2: Commonsense analogy

4 Experiments

4.1 Analogy Datasets

Table 3 compares Chinese word analogy datasets. Most analogies in existing datasets are morphological (morph.) or named entity (entity) relations. For example, CA8-Morphological [4] uses 21 shared prefix characters, e.g., 第, to form 2,553 analogies, e.g., 一 : 第一 = 二 : 第二 (one : first = two : second). As for named entities, some 20 word pairs of the capital-country relation can be permuted to form 190 analogies, which require a knowledge base but not commonsense to solve. Only the nature part of CA8 and the man-woman part of CA-Google [11] contains a handful of relations that requires commonsense world knowledge. In constrast, CA-EHN extracts 85K linguist-checked analogies covering 6,490 concept pairs, e.g., (wood木, 馬horse). Table 2 shows a small list of the data, covering such diverse domains as onomatopoeia, disability, kinship, and zoology. Full CA-EHN is available in the supplementary materials.

4.2 Word Embeddings

We trained word embeddings using either GloVe [12] or SGNS [13] on a small or a large corpus. The small corpus consists of the traditional Chinese part of Chinese Gigaword [14] and ASBC 4.0 [10]. The large corpus additionally includes the Chinese part of Wikipedia.

Table 4 shows embedding performance across analogy benchmarks. Cov denotes the number of analogies of which the first three words exist in the embedding. Analogies that are not covered are excluded from that evaluation. Still, we observe that the larger corpus yields higher accuracy across all benchmarks. In addition, using SGNS instead of GloVe provides universal boosts in performance.

While performance on CA-EHN correlates well to that on other benchmarks, commonsense analogies prove to be much more difficult than morphological or named entity analogies for distributed word representations.

4.3 Commonsense Infusing

E-HowNet comes in two major parts: a lexicon and a taxonomy. For the lexicon, we have used it to extract the CA-EHN commonsense analogies. For the taxonomy, we experiment infusing its hypo-hyper and same-taxon relations to distributed word representations by retrofitting [15]. For example, in Figure 1

, the word vector of 空間 is optimized to be close to both its distributed representation and the word vectors of 空隙 (same-taxon) and 事物 (hypo-hyper). Table

4 shows that retrofitting embeddings with E-HowNet taxonomy improves performance on most benchmarks, and all three embeddings have doubled accuracies on CA-EHN. This shows that CA-EHN is a great indicator of how well word representations embed commonsense knowledge.

Benchmark Language Type #analogies #words #relations #cncpt. ana.
CA8-Morphological Simplified reduplication A (morph.) 2,554 344 3
reduplication AB (morph.) 2,535 423 3
semi-prefix (morph.) 2,553 656 21
semi-suffix (morph.) 2,535 727 41
CA8-Semantic Simplified geography (entity) 3,192 305 9
history (entity) 1,465 177 4
nature 1,370 452 10
people (entity) 1,609 259 5
CA-Google Trad. from Eng. morph., entity, gender 11,126 498 14
CA-EHN Traditional physical 2,027,133 17,595 29,343 238,250
phy. + common 210,050 6,818 8,801 36,100
phy. + com. + linguist1 145,400 6,350 6,536 24,439
phy. + com. + linguist12 85,226 5,563 6,490 24,359
Table 3: Analogy benchmarks
Embedding CA8-Morph. CA8-Semantic CA-Google CA-EHN
Algo. Corpus Words Cov Acc Cov Acc Cov Acc Cov Acc
GloVe Small 517,015 6,703 0.082 4,141 0.308 5,367 0.381 85,226 0.044
+retrofit 0.108 0.321 0.391 0.100
GloVe Large 1,004,750 7,110 0.112 5,619 0.370 8,409 0.437 85,226 0.062
+retrofit 0.133 0.370 0.418 0.128
SGNS Large 1,004,750 7,110 0.173 5,619 0.374 8,409 0.502 85,226 0.062
+retrofit 0.176 0.383 0.489 0.135
Table 4: Embedding performance

5 Conclusion

We have presented CA-EHN, the first commonsense word analogy dataset, by leveraging word sense definitions in E-HowNet. After linguist checking, we have 85,226 Chinese analogies covering 5,563 words and 6,490 commonsense relations. We anticipate that CA-EHN will become an important benchmark testing how well future embedding methods capture commonsense knowledge, which is crucial for models to generalize inference beyond their training corpora. With translations provided by E-HowNet, Chinese words in CA-EHN can be transferred to English MWEs.

References

  • [1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    , 2015.
  • [2] Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
  • [3] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
  • [4] Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, and Xiaoyong Du. Analogical reasoning on chinese morphological and semantic relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018.
  • [5] Wei-Yun Ma and Yueh-Yin Shih. Extended hownet 2.0 – an entity-relation common-sense representation model. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • [6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [7] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.
  • [8] Robyn Speer and Catherine Havasi. Representing general relational knowledge in conceptnet 5. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), 2012.
  • [9] Zhendong Dong and Qiang Dong. Hownet - a hybrid language and knowledge resource. In

    Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering

    , 2003.
  • [10] Wei-Yun Ma, Yu-Ming Hsieh, Chang-Hua Yang, and Keh-Jiann Chen. Design of management system for chinese corpus construction. In Proceedings of Research on Computational Linguistics Conference XIV, 2001.
  • [11] Chi-Yen Chen and Wei-Yun Ma. Word embedding evaluation datasets and wikipedia title embedding for chinese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • [12] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
  • [13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, 2013.
  • [14] David Graff and Ke Chen. Chinese gigaword ldc2003t09. Linguistic Data Consortium, 2003.
  • [15] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.