Source code of the paper 'Integrating Semantic Knowledge to Tackle Zero-shot Text Classification. NAACL-HLT 2019. '
Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classification. Recognising text documents of classes that have never been seen in the learning stage, so-called zero-shot text classification, is therefore difficult and only limited previous works tackled this problem. In this paper, we propose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and a general knowledge graph) are incorporated into the proposed framework to deal with instances of unseen classes effectively. Experimental results show that each and the combination of the two phases achieve the best overall accuracy compared with baselines and recent approaches in classifying real-world texts under the zero-shot scenario.READ FULL TEXT VIEW PDF
This work investigates the use of natural language to enable zero-shot m...
This paper studies the problem of detecting novel or unexpected instance...
In real-world recognition/classification tasks, limited by various objec...
Zero-shot text classification (0Shot-TC) is a challenging NLU problem to...
Zero-shot learning aims to recognize instances of unseen classes, for wh...
Relation classification aims to extract semantic relations between entit...
Knowledge graph learning plays a critical role in integrating domain spe...
Source code of the paper 'Integrating Semantic Knowledge to Tackle Zero-shot Text Classification. NAACL-HLT 2019. '
As one of the most fundamental problems in machine learning, automatic classification has been widely studied in several domains. However, many approaches, proven to be effective in traditional classification tasks, cannot catch up with a dynamic and open environment where new classes can emerge after the learning stageRomera-Paredes and Torr (2015). For example, the number of topics on social media is growing rapidly, and the classification models are required to recognise the text of the new topics using only general information (e.g., descriptions of the topics) since labelled training instances are unfeasible to obtain for each new topic Lee et al. (2011). This scenario holds in many real-world domains such as object recognition and medical diagnosis Xian et al. (2017); World Health Organization (1996).
Zero-shot learning (ZSL) for text classification aims to classify documents of classes which are absent from the learning stage. Although it is challenging for a machine to achieve, humans are able to learn new concepts by transferring knowledge from known to unknown domains based on high-level descriptions and semantic representations Thrun and Pratt (1998). Therefore, without labelled data of unseen classes, a zero-shot learning framework is expected to exploit supportive semantic knowledge (e.g., class descriptions, relations among classes, and external domain knowledge) to generally infer the features of unseen classes using patterns learned from seen classes.
So far, three main types of semantic knowledge have been employed in general zero-shot scenarios Fu et al. (2018). The most widely used one is semantic attributes of classes such as visual concepts (e.g., colours, shapes) and semantic properties (e.g., behaviours, functions) Lampert et al. (2009); Zhao et al. (2018). The second type is concept ontology, including class hierarchy and knowledge graphs, which represents relationships among classes and features Wang et al. (2018); Fergus et al. (2010). The third type is semantic word embeddings which capture implicit relationships between words thanks to a large training text corpus Socher et al. (2013); Norouzi et al. (2013). Nonetheless, concerning ZSL in text classification particularly, there are few studies exploiting one of these knowledge types and none has considered the combinations of them Pushp and Srivastava (2017); Dauphin et al. (2013). Moreover, some previous works used different datasets to train and test, but there is similarity between classes in the training and testing set. For example, in Dauphin et al. (2013), the class “imdb.com” in the training set naturally corresponds to the class “Movies” in the testing set. Hence, these methods are not working under a strict zero-shot scenario.
To tackle the zero-shot text classification problem, this paper proposes a novel two-phase framework together with data augmentation and feature augmentation (Figure 1
). In addition, four kinds of semantic knowledge including word embeddings, class descriptions, class hierarchy, and a general knowledge graph (ConceptNet) are exploited in the framework to effectively learn the unseen classes. Both of the two phases are based on convolutional neural networksKim (2014). The first phase called coarse-grained classification judges if a document is from seen or unseen classes. Then, the second phase, named fine-grained classification, finally decides its class. Note that all the classifiers in this framework are trained using labelled data of seen classes (and augmented text data) only. None of the steps learns from the labelled data of unseen classes.
The contributions of our work can be summarised as follows.
We propose a novel deep learning based two-phase framework, including coarse-grained and fine-grained classification, to tackle the zero-shot text classification problem. Unlike some previous works, our framework does not require semantic correspondence between classes in a training stage and classes in an inference stage. In other words, the seen and unseen classes can be clearly different.
We propose a novel data augmentation technique called topic translation to strengthen the capability of our framework to detect documents from unseen classes effectively.
We propose a method to perform feature augmentation by using integrated semantic knowledge to transfer the knowledge learned from seen to unseen classes in the zero-shot scenario.
In the remainder of this paper, we firstly explain our proposed zero-shot text classification framework in section 2. Experiments and results, which demonstrate the performance of our framework, are presented in section 3. Related works are discussed in section 4. Finally, section 5 concludes our work and mentions possible future work.
Let and be disjoint sets of seen and unseen classes of the classification respectively. In the learning stage, a training set is given where is the -th document containing a sequence of words and is the class of . In the inference stage, the goal is to predict the class of each document, , in a testing set which has the same data format as the training set except that comes from . Note that (i) every class comes with a class label and a class description (Figure 2a); (ii) a class hierarchy showing superclass-subclass relationships is also provided (Figure 2b); (iii) the documents from unseen classes cannot be observed to train the framework.
As discussed in the Introduction, our proposed classification framework consists of two phases (Figure 1). The first phase, coarse-grained classification, predicts whether an input document comes from seen or unseen classes. We also apply a data augmentation technique in this phase to help the classifiers be aware of the existence of unseen classes without accessing their real data. Then the second phase, fine-grained classification, finally specifies the class of the input document. It uses either a traditional classifier or a zero-shot classifier depending on the coarse-grained prediction given by Phase 1. Also, feature augmentation based on semantic knowledge is used to provide additional information which relates the document and the unseen classes to generalise the zero-shot reasoning.
We use the following notations in Figure 1 and throughout this paper.
The list of embeddings of each word in the document is denoted by .
The embedding of each class label is denoted by , . It is assumed that each class has a one-word class label. If the class label has more than one word, a similar one-word class label is provided to find .
Given a document , Phase 1 performs a binary classification to decide whether or . In this phase, each seen class has its own CNN classifier (with a subsequent dense layer and a sigmoid output) to predict the confidence that comes from the class , i.e., . The classifier uses as an input and it is trained using a binary cross entropy loss with all documents of its class in the training set as positive examples and the rest as negative examples.
For a test document , this phase computes for every seen class in . If there exists a class such that , it predicts ; otherwise, . is a classification threshold for the class , calculated based on the threshold adaptation method from Shu et al. (2017).
During the learning stage, the classifiers in Phase 1 use negative examples solely from seen classes, so they may not be able to differentiate the positive class from unseen classes. Hence, when the names of unseen classes are known in the inference stage, we try to introduce them to the classifiers in Phase 1 via augmented data so they can learn to reject the instances likely from unseen classes. We do data augmentation by translating a document from its original seen class to a new unseen class using analogy. We call this process topic translation.
In the word level, we translate a word in a document of class to a corresponding word in the context of a target class by solving an analogy question “: :: :?”. For example, solving the analogy “company:firm :: village:?” via word embeddings Mikolov et al. (2013), we know that the word “firm” in a document of class “company” can be translated into the word “hamlet” in the context of class “village”. Our framework adopts the 3CosMul method by Levy and Goldberg (2014) to solve the analogy question and find candidates of :
where is a vocabulary set and
is a cosine similarity score between the vectors of wordand word . Also, is a small number (i.e., 0.001) added to prevent division by zero.
In the document level, we follow Algorithm 1 to translate a document of class into the topic of another class . To explain, we translate all nouns, verbs, adjectives, and adverbs in the given document to the target class, word-by-word, using the word-level analogy. The word to replace must have the same part of speech as the original word and all the replacements in one document are 1-to-1 relations, enforced by replace_dict in Algorithm 1. With this idea, we can create augmented documents for the unseen classes by topic-translation from the documents of seen classes in the training dataset. After that, we can use the augmented documents as additional negative examples for all the CNNs in Phase 1 to make them aware of the tone of unseen classes.
Phase 2 decides the most appropriate class for using two CNN classifiers: a traditional classifier and a zero-shot classifier as shown in Figure 1. If predicted by Phase 1, the traditional classifier will finally select a class as . Otherwise, if , the zero-shot classifier will be used to select a class as .
The traditional classifier and the zero-shot classifier have an identical CNN-based structure followed by two dense layers but their inputs and outputs are different. The traditional classifier is a multi-class classifier ( classes) with a softmax output, so it requires only the word embeddings as an input. This classifier is trained using a cross entropy loss with a training dataset whose examples are from seen classes only.
In contrast, the zero-shot classifier is a binary classifier with a sigmoid output. Specifically, it takes a text document and a class as inputs and predicts the confidence . However, in practice, we utilise to represent , to represent the class , and also augmented features to provide more information on how intimate the connections between words and the class are. Altogether, for each word , the classifier receives the concatenation of three vectors (i.e., ) as an input. This classifier is trained using a binary cross entropy loss with a training data from seen classes only, but we expect this classifier to work well on unseen classes thanks to the distinctive patterns of in positive examples of every class. This is how we transfer knowledge from seen to unseen classes in ZSL.
The relationship vector contains augmented features we input to the zero-shot classifier. shows how the word and the class are related considering the relations in a general knowledge graph. In this work, we use ConceptNet providing general knowledge of natural language words and phrases Speer and Havasi (2013). A subgraph of ConceptNet is shown in Figure 2c as an illustration. Nodes in ConceptNet are words or phrases, while edges connecting two nodes show how they are related either syntactically or semantically.
We firstly represent a class as three sets of nodes in ConceptNet by processing the class hierarchy, class label, and class description of . (1) the_class_nodes is a set of nodes of the class label and any tokens inside if has more than one word. (2) superclass_nodes is a set of nodes of all the superclasses of according to the class hierarchy. (3) description_nodes is a set of nodes of all nouns in the description of the class . For example, if is the class “Educational Institution”, according to Figure 2a-2b, the three sets of ConceptNet nodes for this class are:
(1) educational_institution, educational, institution
(2) organization, agent
(3) place, people, ages, education.
To construct , we consider whether the word is connected to the members of the three sets above within hops by particular types of relations or not111In this paper, we only consider the most common types of positive relations which are RelatedTo, IsA, PartOf, and AtLocation. They cover 60% of all edges in ConceptNet.. For each of the three sets, we construct a vector with dimensions.
if is a node in that set; otherwise, .
if there is a node in the set whose shortest path to is . Otherwise, .
is the number of nodes in the set whose shortest path to is .
is divided by the total number of nodes in the set.
Thus, the vector associated to each set shows how is semantically close to that set. Finally, we concatenate the constructed vectors from the three sets to become with dimensions.
We used two textual datasets for the experiments. The vocabulary size of each dataset was limited by 20,000 most frequent words and all numbers were excluded. (1) DBpedia ontology dataset Zhang et al. (2015) includes 14 non-overlapping classes and textual data collected from Wikipedia. Each class has 40,000 training and 5,000 testing samples. (2) The 20newsgroups dataset 222http://qwone.com/jason/20Newsgroups/ has 20 topics each of which has approximately 1,000 documents. 70% of the documents of each class were randomly selected for training, and the remaining 30% were used as a testing set.
In our experiments, two different rates of unseen classes, 50% and 25%, were chosen and the corresponding sizes of and are shown in Table 1. For each dataset and each unseen rate, the random selection of (, ) were repeated ten times and these ten groups were used by all the experiments with this setting for a fair comparison. All documents from were removed from the training set accordingly. Finally, the results from all the ten groups were averaged.
In Phase 1, the structure of each classifier was identical. The CNN layer had three filter sizes [3, 4, 5] with 400 filters for each filter size and the subsequent dense layer had 300 units. For data augmentation, we used gensim with an implementation of 3CosMul Řehůřek and Sojka (2010) to solve the word-level analogy (line 5 in Algorithm 1). Also, the numbers of augmented text documents per unseen class for every setting (if used) are indicated in Table 1. These numbers were set empirically considering the number of available training documents to be translated.
In Phase 2, the traditional classifier and the zero-shot classifier had the same structure, in which the CNN layer had three filter sizes [2, 4, 8] with 600 filters for each filter size and the two intermediate dense layers had 400 and 100 units respectively. For feature augmentation, the maximum path length in ConceptNet was set to 3 to create the relationship vectors555Based on our observation, most of the related words stay within 3 hops from the class nodes in ConceptNet.. The DBpedia ontology666http://mappings.dbpedia.org/server/ontology/classes/ was used to construct a class hierarchy of the DBpedia dataset. The class hierarchy of the 20newsgroups dataset was constructed based on the namespaces initially provided by the dataset. Meanwhile, the classes descriptions of both datasets were picked from Macmillan Dictionary777https://www.macmillandictionary.com/ as appropriate.
For both phases, we used 200-dim GloVe vectors888glove6B.zip in https://nlp.stanford.edu/projects/glove/ for word embeddings and Pennington et al. (2014). All the deep neural networks were implemented with TensorLayer Dong et al. (2017a)
and TensorFlowAbadi et al. (2016).
|Dataset||Unseen rate||#Augmented docs per|
We compared each phase and the overall framework with the following approaches and settings.
Phase 1: Proposed by Shu et al. (2017), DOC is a state-of-the-art open-world text classification approach which classifies a new sample into a seen class or “reject” if the sample does not belong to any seen classes. The DOC uses a single CNN and a 1-vs-rest sigmoid output layer with threshold adjustment. Unlike DOC, the classifiers in the proposed Phase 1 work individually. However, for a fair comparison, we used DOC only as a binary classifier in this phase ( or ).
Phase 2: To see how well the augmented feature work in ZSL, we ran the zero-shot classifier with different combinations of inputs. Particularly, five combinations of , , and were tested with documents from unseen classes only (traditional ZSL).
The whole framework: (1) Count-based model selected the class whose label appears most frequently in the document as . (2) Label similarity Sappadla et al. (2016)
is an unsupervised approach which calculates the cosine similarity between the sum of word embeddings of each class label and the sum of word embeddings of every n-gram () in the document. We adopted this approach to do single-label classification by predicting the class that got the highest similarity score among all classes. (3)
RNN AutoEncoderwas built based on a Seq2Seq model with LSTM (512 hidden units), and it was trained to encode documents and class labels onto the same latent space. The cosine similarity was applied to select a class label closest to the document on the latent space. (4) RNN+FC refers to the architecture 2 proposed in Pushp and Srivastava (2017). It used an RNN layer with LSTM (512 hidden units) followed by two dense layers with 400 and 100 units respectively. (5) CNN+FC replaced the RNN in the previous model with a CNN, which has the identical structure as the zero-shot classifier in Phase 2. Both RNN+FC and CNN+FC predicted the confidence given and . The class with the highest confidence was selected as .
For Phase 1, we used the accuracy for binary classification ( or
) as an evaluation metric. In contrast, for Phase 2 and the whole framework, we used the multi-class classification accuracy () as a metric.
|Dataset||Unseen rate||Count-based||Label Similarity Sappadla et al. (2016)||RNN Autoencoder||RNN + FC Pushp and Srivastava (2017)||CNN + FC||Ours|
The evaluation of Phase 1 (coarse-grained classification) checks if each was correctly delivered to the right classifier in Phase 2. Table 3 shows the performance of Phase 1 with and without augmented data compared with DOC. Considering test documents from seen classes only, our framework outperformed DOC on both datasets. In addition, the augmented data improved the accuracy of detecting documents from unseen classes clearly and led to higher overall accuracy in every setting. Despite no real labelled data from unseen classes, the augmented data generated by topic translation helped Phase 1 better detect documents from unseen classes. Table 4 shows some examples of augmented data from the DBpedia dataset. Even if they are not completely understandable, they contain the tone of the target classes.
|Dataset Unseen rate||DOC||Ours w/o aug.||Ours w/ aug.|
|Animal (Original)||Mitra perdulca is a species of sea snail a marine gastropod mollusk in the family Mitridae the miters or miter snails.|
|Animal Plant||Arecaceae perdulca is a flowering of port aster a naval mollusk gastropod in the fabaceae Clusiaceae the tiliaceae or rockery amaryllis.|
|Animal Athlete||Mira perdulca is a swimmer of sailing sprinter an Olympian limpets gastropod in the basketball Middy the miters or miter skater.|
Although Phase 1 provided confidence scores for all seen classes, we could not use them to predict directly since the distribution of scores of positive examples from different CNNs are different. Figure 3
shows that the distribution of confidence scores of the class “Artist” had a noticeably larger variance and was clearly different from the class “Building”. Hence, even if, we cannot conclude that is more likely to come from the class “Building”. This is why a traditional classifier in Phase 2 is necessary.
Regarding Phase 2, fine-grained classification is in charge of predicting and it employs two classifiers which were tested separately. Assuming Phase 1 is perfect, the classifiers in Phase 2 should be able to find the right class. The purpose of Table 5 is to show that the traditional CNN classifier in Phase 2 was highly accurate.
Besides, given test documents from unseen classes only, the performance of the zero-shot classifier in Phase 2 is shown in Table 6. Based on the construction method, quantified the relatedness between words and the class but, unlike and , it did not include detailed semantic meaning. Thus, the classifier using only could not find out the correct unseen class and neither and could do. On the other hand, the combination of , which included semantic embeddings of both words and the class label, increased the accuracy of predicting unseen classes clearly. However, the zero-shot classifier fed by the combination of all three types of inputs achieved the highest accuracy in all settings. It asserts that the integration of semantic knowledge we proposed is an effective means for knowledge transfer from seen to unseen classes in the zero-shot scenario.
|Input Unseen rate||50%||25%||50%||25%|
|Inputs Unseen rate||50%||25%||50%||25%|
Last but most importantly, we compared the whole framework with four baselines as shown in Table 2. First, the count-based model is a rule-based model so it failed to predict documents from seen classes accurately and resulted in unpleasant overall results. This was similar to the label similarity approach even though it had higher degree of flexibility. Next, the RNN Autoencoder was trained without any supervision since was predicted based on the cosine similarity. We believe the implicit semantic relatedness between classes caused the failure of the RNN Autoencoder. Besides, the CNN+FC and RNN+FC had same inputs and outputs and it was clear that CNN+FC performed better than RNN+FC in the experiment. However, neither CNN+FC nor RNN+FC was able to transfer the knowledge learned from seen to unseen classes. Finally, our two-phase framework has competitive prediction accuracy on unseen classes while maintaining the accuracy on seen classes. This made it achieve the highest overall accuracy on both datasets and both unseen rates. In conclusion, by using integrated semantic knowledge, the proposed two-phase framework with data and feature augmentation is a promising step to tackle this challenging zero-shot problem.
Furthermore, another benefit of the framework is high flexibility. As the modules in Figure 1 has less coupling to one another, it is flexible to improve or customise each of them. For example, we can deploy an advanced language understanding model, e.g., BERT Devlin et al. (2018), as a traditional classifier. Moreover, we may replace ConceptNet with a domain-specific knowledge graph to deal with medical texts.
There are a few more related works to discuss besides recent approaches we compared with in the experiments (explained in section 3.3). Dauphin et al. Dauphin et al. (2013) predicted semantic utterance of texts by mapping class labels and text samples into the same semantic space and classifying each sample to the closest class label. Nam et al. Nam et al. (2016) learned the embeddings of classes, documents, and words jointly in the learning stage. Hence, it can perform well in domain-specific classification, but this is possible only with a large amount of training data. Overall, most of the previous works exploited semantic relationships between classes and documents via embeddings. In contrast, our proposed framework leverages not only the word embeddings but also other semantic knowledge. While word embeddings are used to solve analogy for data augmentation in Phase 1, the other semantic knowledge sources (in Figure 2) are integrated into relationship vectors and used as augmented features in Phase 2. Furthermore, our framework does not require any semantic correspondences between seen and unseen classes.
In the face of insufficient data, data augmentation has been widely used to improve generalisation of deep neural networks especially in computer visionKrizhevsky et al. (2012) and multimodality Dong et al. (2017b)
, but it is still not a common practice in natural language processing. Recent works have explored data augmentation in NLP tasks such as machine translation and text classificationSaito et al. (2017); Fadaee et al. (2017); Kobayashi (2018), and the algorithms were designed to preserve semantic meaning of an original document by using synonyms Zhang and LeCun (2015) or adding noises Xie et al. (2017), for example. In contrast, our proposed data augmentation technique translates a document from one meaning (its original class) to another meaning (an unseen class) by analogy in order to substitute unavailable labelled data of the unseen class.
Apart from improving classification accuracy, feature augmentation is also used in domain adaptation to transfer knowledge between a source and a target domain Pan et al. (2010b); Fang and Chiang (2018); Chen et al. (2018). An early research paper applying feature augmentation in NLP is Daume III (2007) which targeted domain adaptation on sequence labelling tasks. After that, feature augmentation was used in several NLP tasks such as cross-domain sentiment classification Pan et al. (2010a), multi-domain machine translation Clark et al. (2012), semantic argument classification Batubara et al. (2018), etc. Our work is different from previous works not only that we applied this technique to zero-shot text classification but also that we integrated many types of semantic knowledge to create the augmented features.
To tackle zero-shot text classification, we proposed a novel CNN-based two-phase framework together with data augmentation and feature augmentation. The experiments show that data augmentation by topic translation improved the accuracy in detecting instances from unseen classes, while feature augmentation enabled knowledge transfer from seen to unseen classes for zero-shot learning. Thanks to the framework and the integrated semantic knowledge, our work achieved the highest overall accuracy compared with all the baselines and recent approaches in all settings. In the future, we plan to extend our framework to do multi-label classification with a larger amount of data, and also study how semantic units defined by linguists can be used in the zero-shot scenario.
We would like to thank Douglas McIlwraith, Nontawat Charoenphakdee, and three anonymous reviewers for helpful suggestions. Jingqing and Piyawat would also like to thank the support from the LexisNexis® Risk Solutions HPCC Systems® academic program and Anandamahidol Foundation, respectively.
Thirtieth AAAI Conference on Artificial Intelligence.
A survey on transfer learning.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.