In linguistics, sememes are defined as the minimum semantic units of human languages bloomfield1926set. Some linguists believe that the meanings of all the words can be described with a limited closed set of sememes, which is similar to the idea of semantic prime wierzbicka1996semantics. However, sememes are normally implicit and cannot be recognized directly. Therefore, people manually annotate words with a set of predefined sememes to construct sememe KBs.
HowNet dong2003HowNet is the most famous sememe KB. It defines about language-independent sememes and uses them to annotate over 100 thousand Chinese and English words. Every word in HowNet contains some senses, and each sense is annotated with a hierarchical structure of sememes, i.e., sememe tree. Figure 1 exhibits an example of how words are annotated in HowNet. The word “observatory” has one sense, and this sense has 4 sememes, namely “location”, “investigate”, “celestial body” and “celestial event”. HowNet has been successfully utilized in various NLP applications such as semantic similarity computation liu2002, word sense disambiguation zhang2005chinese; duan2007wordzhu2006semantic; xianghua2013multi; dang2010method, language modeling gu2018language, word embedding niu2017improved, and word classification zeng2018chinese.
Seeing that numerous new words emerge constantly and meanings of existing words keep altering, it is labor-intensive and time-consuming to expand and update sememe KBs like HowNet. To solve this problem, xie2017lexical propose the task of sememe prediction, aiming to automatically recommend related sememes to the words without sememe annotation. They also present two simple but effective embedding-based models for the task. These two models ignore the hierarchical structure of sememe trees and predict a set of sememes for every word. jin2018incorporating further propose to incorporate Chinese characters of words into sememe prediction and improve the prediction performance.
These methods rely heavily on the representations of words or characters. As a result, it is hard to predict proper sememes for those low-frequency words or words with low-frequency characters because of their poor embeddings. In fact, there are other available resources that can be used in sememe prediction to tackle the challenge. Dictionary definitions clearly explain the meanings of words including low-frequency words, and they are also easy to obtain. Thus, we believe dictionary definitions are appropriate for sememe prediction. li2018sememe firstly propose to use dictionary definitions in sememe prediction. However, they adopt a sequence to sequence model that foists inappropriate order on sememes. Additionally, they regard definitions as sequences of characters, which can hurt representations of the definitions owing to Chinese characters’ ambiguity.
In this paper, we propose a novel model for sememe prediction using dictionary definitions. The model not only addresses the issues of existing methods but also can capture local semantic correspondence, a kind of particular matching relationship between sememe trees and definitions. More explicitly, for a given word, we find each of its sememes can be matched to one or more of its definition words, i.e., words in the dictionary definition. Taking Figure 1 for example, the sememes “location”, “investigate”, and “celestial body, celestial event” are semantically matched to the words “institution”, “observing, researching”, and “celestial, astronomy” respectively.
The reason for local semantic correspondence is easy to explain. Both a word’s sememe tree and dictionary definition express the same meaning. The sememe tree of “observatory” represents the meaning of “location of investigating celestial events and celestial bodies”, which is almost identical to that of the dictionary definition. Furthermore, sememe trees and dictionary definitions can be decomposed into sememes and definition words respectively. Consequently, it is natural to see that there is local semantic correspondence between a word’s sememes and definition words.
To take advantage of local semantic correspondence, we propose the model Sememe Correspondence Pooling (SCorP)
. SCorP computes the correspondence scores between all sememes and words in definitions, and do max-pooling over correspondence scores for every sememe. Moreover, SCorP model ignores the hierarchical structures of sememes in sememe prediction following previous worksxie2017lexical; jin2018incorporating; li2018sememe, and uses the framework of sequence to set multi-label classification to avoid imposing order on sememes. And the inputs to SCorP model are word sequences rather than character sequences, which remedies the ambiguity problem of characters and at the same time, enables utilizing the sememe information of the definition words. In addition, we propose two effective retrofitting operations for SCorP which improves the prediction performance. In our experiments, we evaluate the sememe prediction performance of our SCorP model and several baselines. Experimental results show that our model achieves state-of-the-art performance. We also conduct further quantitative analysis and find that our model can correctly match definition words to sememes, which explains the superiority of our model.
To conclude, the contributions of this paper are twofold: (1) we discover local semantic correspondence, a kind of specific semantic matching between a word’s sememe tree and definition; (2) we propose a novel model which utilizes local semantic correspondence to predict sememes, and it achieves state-of-the-art performance in sememe prediction.
2 Related Work
Applications of Sememes
Sememes have been widely used in NLP tasks, such as semantic similarity computation liu2002, word sense disambiguation zhang2005chinese; duan2007word, and sentiment analysis xianghua2013multi; dang2010method; zhu2006semantic; huang2014new. In addition, niu2017improved incorporate sememes into word representation learning and utilize sememes to capture the exact meaning of a word within specific contexts. zeng2018chinese
use sememe knowledge with the attention mechanism to determine the type of a word and expand the Chinese LIWC lexicon(pennebakerlinguistic). gu2018language propose to regard sememes as linguistic experts to help predict suitable words in language modeling.
xie2017lexical present the automatic sememe prediction task for the first time. They also propose a collaborative filtering-based model SPWE and a matrix factorization-based model SPSE for this task and achieve acceptable sememe prediction results. Following this work, jin2018incorporating propose two models SPWCF and SPCSE which can leverage the internal character information of Chinese words, as well as an ensemble model CSP which considers both internal and external information. li2018sememe first explore the role of dictionary descriptions in sememe prediction. They propose a sequence to sequence model named LD+Seq2Seq, where the inputs are character sequences of dictionary definitions or wiki descriptions, and the outputs are sequences of predicted sememes. In addition, there are also works trying to build sememe KB for another language by cross-lingual sememe prediction qi2018cross.
Applications of Dictionary Definitions
Dictionary definitions are abundant and valuable research resources in NLP. They have been used in various tasks, such as word sense disambiguation luo2018incorporating, knowledge representation learning xie2016representation; zhong2015aligning, reading comprehension long2017world, and reverse dictionary hill2016learning; bosc2018auto
. Most of the previous works encode definitions into vectors for downstream tasks. In addition, some works adopt graph-based methods to build a word graph using dictionary definitions for downstream tasksthorat2016implementing; tissier2017dict2vec.
In this section, we first introduce some terms and notations which will be used below. Then we give a brief description to a basic sequence to set multi-label classification framework, on which our SCorP model is based. Next, we present our SCorP model and two different retrofitting operations in detail. Finally, we succinctly describe an ensemble model.
3.1 Terms and Notations
As mentioned previously, we use “definition word” to refer to a word in dictionary definitions. And we use “target word” to signify the word for which we want to predict sememes. We define as the vocabulary set and as the set of sememes. Given a word , all of its sememes form a set , where represents the cardinality of a set. The dictionary definition of word is denoted by , where is its -th definition word. We use lowercase boldface symbols for vectors and uppercase boldface symbols for matrices. For example, is the word vector of , is the sememe vector of , and is the matrix composed of word vectors .
3.2 Basic Multi-label Classification Framework
The sequence to set multi-label classification (MC) framework serves as the base of our SCorP model. It has two main parts, the encoder which can encode a dictionary definition into a vector, and the multi-label classifier, which uses the definition vector to compute association scores for each sememe. And the sememes with high scores are selected as the predicted sememes.
We choose Bidirectional LSTM (BiLSTM) schuster1997bidirectional as the encoder. Formally, for the dictionary definition of a target word , we pass , the pre-trained word embeddings of definition words, to the BiLSTM. Then BiLSTM will output two sequences of hidden states:
We use the concatenation of the last hidden states of both directions as the definition vector which is denoted by , and feed it to a fully connected layer:
where , represents the dimension of hidden states in a single direction. , the -th element of , denotes the association score of -th sememe. For training, we use the multi-label one-versus-all cross-entropy loss:
where represents whether the -th sememe is in the sememe set of word .
3.3 SCorP Model
The basic MC model encodes the whole definition into a vector, which will cause information loss and especially, make it difficult to cope with long definitions. Our Sememe Correspondence Pooling (SCorP) model, which exploits local semantic correspondence, can handle this problem.
A sememe prediction example of SCorP model is illustrated in Figure 2. The encoder of SCorP model is similar to that of MC model. But SCorP model uses the concatenations of all the bidirectional hidden states in Equation 1, rather than only the concatenation of the last bidirectional hidden states:
Obviously contains the context semantic information of , the -th definition word. Then every is passed to a fully connected layer, and the output constitutes a matrix:
where is in fact the semantic correspondence matrix, and measures the correspondence between -th sememe and -th definition word.
According to local semantic correspondence, each sememe of a word strongly corresponds to its one or more definition words. In other words, in sememe prediction, whether a sememe is selected as the predicted result depends on whether there is strong correspondence between the sememe and some words in the definition. Inspired by this, SCorP model calculates the final association score of every sememe by doing max-pooling over the correspondence scores between the sememe and all the definition words:
3.4 Retrofitting Operations
Incorporating Sememes of Definition Words
In this subsection, we propose to incorporate sememe information of definition words into the encoder. On the one hand, sememe information has been proven effective in improving word embeddings niu2017improved. On the other hand, we find that if a sememe corresponds to a definition word, it usually belongs to the sememe set of the definition word. Still taking the target word “observatory” for example, its sememe “investigate” semantically corresponds to the definition word “observing” and at the same time, “investigate” is also annotated to “observing” in HowNet. Therefore, the sememes of definition words should be helpful to sememe prediction.
We use a simple but effective method to incorporate sememe information of definition words, i.e., adding the average of sememe embeddings to the embedding of the definition word:
where is the sememe set of and is the retrofitted embedding of . For the definition words without sememe annotation, their retrofitted embeddings are equal to their original word embeddings. For simplicity, we use “+SE” to indicate incorporating sememes of definition words.
Integrating Embeddings of Target Words
In this subsection, we present an approach to integrating embeddings of target words into our model. If a target word has word embedding, we simply put it and a separator in front of its dictionary definition to acquire the retrofitted definition sequence. We choose colon (:) as the separator. Formally, for a given target word , the retrofitted definition sequence is . Then we feed the embeddings of the elements of the retrofitted definition sequence to the BiLSTM encoder.
As for the target words that have no embedding, we borrow the idea of byte pair encoding from sequence modeling sennrich2016neural to deal with them. For a target word without embeddings, we first split it into smaller subwords or characters which have embeddings. Then we add these subwords or characters and a separator to the definition sequence of the target word. We use “+TW” to indicate integrating embeddings of target words.
3.5 Ensemble Model
Our SCorP model utilizes dictionary definitions and word embeddings, while previous CSP model jin2018incorporating mainly uses internal character information of target words. The two models use diverse information and naturally, we combine the two models to build an ensemble model. This model scores a sememe by doing a weighted addition of sememe scores outputted by the two submodels.
We choose HowNet as the sememe KB, which annotates 103,843 Chinese words and 103,390 English words. These words contain 212,539 language-independent senses. Following previous works xie2017lexical; jin2018incorporating, we disregard the hierarchical structures of sememes in sememe prediction and filter out the low-frequency sememes which appears less than times in HowNet. The final number of sememes we use is . In addition, we concatenate all definitions of a word and use the concatenated definition for training and evaluation.
We evaluate our models on Chinese and English. For Chinese, we use Modern Chinese Dictionary (6th Edition)111http://www.cp.com.cn/book/978-7-100-08467-3_44.html as the source of dictionary definitions. It defines about Chinese words and characters. We extract dictionary definitions and segment them into words using THULAC li2009punctuation. The final dataset contains words, which simultaneously have sememe annotation, word embeddings, and dictionary definitions. Word embeddings are trained using Skip-gram mikolov2013distributed on the SogouT corpus222Sogou-T is a corpus of web pages provided by a Chinese commercial search engine. https://www.sogou.com/labs/resource/t.php. The dataset is randomly divided into non-overlapping training, validation, and test sets in the ratio of 8:1:1.
For English, we use WordNet miller1995wordnet as the source of dictionary definitions, and use the pre-trained Glove pennington2014glove word embeddings. The dataset has 43,907 words and is also divided in the ratio of 8:1:1. Table 1 shows some statistics of the two dataset.
|#words in dictionary||69,000||147,306|
|#senses in dictionary||98,732||206,978|
|#words in HowNet||103,843||103,390|
|#words with embeddings||5,606,977||400,000|
|#words in dataset||48,383||43,907|
|avg. sememes of words||2.57||3.00|
4.2 Experimental Setup
We choose the following 4 models as baselines: (1) SPWE xie2017lexical, an embedding-based model. It first finds some similar words for the target word in the embedding space, then recommends sememes of similar words. (2) CSP jin2018incorporating, an ensemble model utilizing both internal character information, which is employed by its submodels SPWCF and SPCSE, and external word information, which are employed by its submodels SPWE and SPSE xie2017lexical; (3) LD+Seq2Seq li2018sememe, which utilizes dictionary definitions by a reformed sequence to sequence model; (4) MC, the basic sequence to set multi-label classification model in section 3.2.
Hyper-parameters and Training
For all the models, the dimension of word embeddings is . For SCorP and MC models, the dimension of BiLSTM hidden states is . For all the other baseline methods, their hyper-parameters except word embedding dimension are respectively tuned to the best on the validation set. For our ensemble model, the weight ratio of correspondence scores is , which is also determined by tuning on the validation dataset. As for training, we use Adam Optimizer kingma2014adam and the learning rate is . During the training process, the word embeddings are frozen. Dropout is employed in SCorP and MC.
Following previous works, we use mean average precision (MAP) and F1 score as evaluation metrics. For a word withsememes, the model will score all sememes and rank them by scores. If the rankings of the sememes are , we can compute MAP by
For F1 score, we set a threshold for all models except LD+Seq2Seq. All sememes with scores higher than compose the output set. is a hyper-parameter and also tuned to the best on the validation set. We conduct 10-fold cross-validation in experiments.
4.3 Overall Sememe Prediction Performance
Table 2 lists the sememe prediction performance of our models and baseline methods. From this table, we can see that:
(1) Our proposed SCorP(+TW,SE) model achieves the best sememe prediction performance compared with all the single models and previous ensemble model CSP. In addition, our proposed ensemble model achieves state-of-the-art performance.
(2) Our SCorP model and its variants always perform substantially better than the MC model and its corresponding variants. This suggests that a size-fixed definition vector is not enough to contain all the information of the dictionary definition, and it is necessary to consider all the hidden states of the definition. Moreover, the max-pooling operation can effectively capture the local semantic correspondence. To validate this, we compare max-pooling with mean-pooling and attention mechanism in SCorP. All of the three operations can collect information from the hidden states of BiLSTM. The mean-pooling averages the correspondence score for every sememe. In the model with attention mechanism, for every sememe, we use the sememe vector to attend over all the hidden states of the LSTM. The output of the attention is used to compute the association score of the sememe by a fully-connected layer. After replacing the max-pooling with mean-pooling and attention mechanism, the MAP drops from 54.95 to 51.76 and 52.23 respectively. The reason is that a sememe of the target word usually semantically corresponds to only a few words and has nothing to do with most of definition words. In the models with mean-pooling or attention mechanism, the association score of a sememe is inevitably affected by all the hidden states of the definition words, and the hidden states of irrelevant words bring lots of noise.
(3) Basically, both +SE and +TW enhance the sememe prediction performance, no matter for MC or SCorP model. +TW gains most, because word embeddings are powerful and previous works relying only on target word embeddings have achieved acceptable results.
(4) LD+Seq2Seq model has relatively poor performance. This demonstrates it is not a good idea to impose inappropriate order on sememes. Furthermore, using character sequence as input is another possible reason for the bad performance, because the ambiguity of characters may hurt the precise representations of definitions.
4.4 Sememe Prediction for OOV Words
The models which heavily rely on target word embeddings don’t work or suffer serious performance deterioration when faced with out-of-vocabulary (OOV) words. Our SCorP model can cope with the OOV words because it needs no target word embeddings. And the word splitting workaround of our +TW operation can also mitigate the OOV problem.
To test the ability to handle the OOV problem of existing models, we expand the original dataset by adding OOV words which have no embeddings. Then we evaluate our models and baseline methods on the expanded dataset. Notice that for CSP, only two submodels SPWCF and SPCSE can predict sememes for OOV words, and other two submodels are not used.
The evaluation results are exhibited in Table 3. From the table, we can observe that our plain SCorP model and its variant SCorP(+TW) dramatically outperform the two baseline methods, especially for the OOV words. This manifests our model’s extraordinary generalization capability to tackle the OOV problem. We also confirm the effectiveness of word splitting workaround by a lesion experiment. After removing word splitting for target words in +TW (denoted by +TW-WS), SCorP(+TW,-WS) model performs considerably worse in sememe prediction for the OOV words, even close to the plain SCorP model making no use of target word embeddings.
4.5 Influence of Word Frequency
Table 4 lists the evaluation results of 4 models for words with different frequencies. SCorP(+SE,TW) produces the best performance in all the word frequency ranges, which manifests its remarkable robustness. CSP, which relies heavily on word embeddings, suffers a sharp performance drop when faced with low-frequency words (word frequency less than 50), although they perform acceptably on high-frequency words. LD+Seq2Seq is less affected by word frequencies, but it badly lags behind other models on whichever word frequency range.
4.6 Case Study
|Definition Words||Predicted Sememes|
|天文台 (observatory)||celestial body/0.20, tell/-2.60, celestial event/-3.26, images/-4.63, light/-4.65, knowledge/-5.00|
|：||celestial body/-4.37, tell/-5.66, time/-5.74, part/-6.38, morning/-7.08, past/-7.17|
|观测 (observe)||investigate/2.37, celestial event/-0.20, far/-1.28, celestial body/-1.35, measurement/-2.19, look/-2.68|
|天体 (celestial body)||celestial body/6.16, celestial event/1.68, investigate/0.63, measurement/-2.12, far/-3.40, look/-3.51|
|和 (and)||celestial body/-1.95, investigate/-2.66, celestial event/-3.61, find/-4.26, look/-5.32, choose/-5.86|
|研究 (research)||investigate/1.78, celestial event/-0.88, celestial body/-1.99, research/-2.27, find/-2.62, knowledge/-3.46|
|天文学 (astronomy)||celestial body/2.50, investigate/1.54, celestial event/1.29, knowledge/0.56, daytime/-1.31, earch/-1.85|
|的 (of)||part/-5.56, time/-6.31, human/-6.36, place/-7.04, tell/-7.16, head/-7.74|
|机构 (institution)||location/1.97, celestial event/-1.32, knowledge/-1.48, celestial body/-2.02, facility/-2.14, tool/-2.90|
|。||time/-1.66, celestial body/-3.02, part/-3.21, investigate/-4.90, celestial event/-4.95, daytime/-5.46|
|Predicted Sememes||celestial body/6.16, investigate/2.37, location/1.97, celestial event/1.68, knowledge/0.56, far/-1.28|
In this section, we conduct a case study on a sememe prediction example, to illustrate the effectiveness of our SCorP model in capturing local semantic correspondence.
We still pick the word “observatory”, which has four sememes involving “location”, “investigate”, “celestial event” and “celestial body”. Our SCorP model obtains the correspondence matrix measuring the semantic correspondence between each sememe and each definition word. Table 5 shows the six most corresponding sememes and their correspondence scores for each definition word. If a sememe and its correspondence score in a certain row is in boldface, it denotes that the sememe achieves the highest correspondence score on the definition word in this row. For example, on the row of definition word “institution”, sememe “location” is in boldface, which means “location” corresponds with “institution” most closely. In addition, sememes “investigate” and “celestial body, celestial event” corresponds with words “observe” and “celestial body” most closely, respectively. The semantic correspondence scores between all punctuation and all sememes are relatively low. These results demonstrate that our SCorP model can properly capture the local semantic correspondence. The last row lists the predicted sememes with the highest correspondence scores after max-pooling. We can clearly see that the four correct sememes obtain the highest scores and are ranked top 4 exactly.
5 Conclusion and Future Work
In this paper, we discover the particular local semantic correspondence in sememe prediction using dictionary definitions. Accordingly, we propose a novel model, which can exploit this property to predict sememes. And we also design two useful retrofitting operations, including incorporating sememes of definition words and integrating embeddings of target words. Experimental results show that our model with the two retrofitting operations achieves state-of-the-art performance.
We will explore the following research directions in the future: (1) taking into account the hierarchical structures of sememes and allocating context-dependent weights when incorporating sememes of definition words. (2) predicting the hierarchical structure of sememes as well, which is a key step for expanding and updating sememe KBs like HowNet. Moreover, dictionary definitions are suitable for structured sememe prediction because different types of definition words correspond to the sememes in different hierarchical layers. (3) transferring our model to cross-lingual sememe prediction to assist in the building of sememe KBs for other languages.