Text classification is the process of determining and assigning topical categories to text. It plays an important role in many web applications, such as contextual advertising , topical web search , and web search personalization . Usually, text classification requires a sufficiently large taxonomy of topical categories to capture various topics in arbitrary texts. In addition, it is necessary to collect a large amount of training data for each category in the taxonomy.
or a deep neural network, which adopts dense semantic encodings and measures semantic similarity accordingly. Implicit representation models have been successfully adopted for text classification task. Such implicit representation models, however, may perform poorly in a large-scale text classification (as we shall show in Section 5.4). This is largely attributed to the fact that the training data for each category is relatively insufficient and distributed unevenly among classification categories. In addition, such approaches are not intuitively interpretable to humans.
In another line of work, many studies have been done with an explicit representation model , which uses popular knowledge bases, such as ProBase, Wikipedia, or the Open Directory Project (ODP)111http://www.curlie.org. Because the explicit model represents knowledge in terms of vectors that are interpretable to both humans and machines, it is relatively easy for humans to tune and understand it. Another advantage of the explicit representation model is that it enables a large-scale text classification with the direct representation of a large-scale knowledge taxonomy already built-in.
To handle the large-scale text classification, several works [1, 9, 10] have utilized the ODP, which is a large-scale and taxonomy-structured web directory. These studies have used their explicit representation of text to represent ODP knowledge, based on bag-of-words [1, 10] or bag-of-phrases  to develop ODP-based text classification techniques. They showed that ODP-based text classification techniques are effective at the large-scale text classification. The performance of previous ODP-based text classification, however, is limited to ODP and/or Wikipedia knowledge bases.
To alleviate the limitation of ODP-based text classification, we incorporate word embeddings into the ODP-based text classification. To this end, we propose two novel joint models of ODP-based classification and word2vec, a representative word embeddings technique. The joint models seek to project both words and ODP categories into the same vector space. Therefore, category vectors of ODP categories successfully identify words learned from external knowledge. In addition, we effectively measure the semantic relatedness between an ODP category and a document by utilizing both category and word vectors. In summary, our contributions are three-fold:
We propose a novel methodology to handle the large-scale text classification, which utilizes both the explicit and implicit representation.
We develop two novel joint models of ODP-based classification and word2vec to generate category vectors that represent the semantics of ODP categories. In addition, we develop a new semantic similarity measure that utilizes both the category and word vectors.
We demonstrate the efficacy of the proposed methodology through extensive experiments on real-world datasets. The performance evaluation clearly shows that our approach significantly outperforms the state-of-the-art techniques in terms of macro-averaging F1-score and precision at .
The remainder of this paper is organized as follows. We briefly describe the ODP-based knowledge representation and word2vec in Section 2. Section 3 describes the joint models of ODP-based classification and word2vec to generate category vectors. Section 4 details the similarity measure between a category and document. We present the performance evaluation results in Section 5. We discuss related research and conclude this work in Sections 6 and 7, respectively.
2.1 ODP-based Knowledge Representation
We employ the ODP-based text classification scheme  as our explicit representation model. To compute the centroid of category , we calculate the averaged term vector of all ODP documents as:
where is a set of ODP documents in , and is a weighted vector represented as a tf-idf value. Due to the large-scale taxonomy structure of the ODP, however, each ODP category contains a different number of documents, sometimes resulting in sparsity or unavailability of training documents in a category. This issue is addressed in the works [1, 10], in which they merge the centroid
of the descendant categories to build a classifier. As a result, this approach outperforms all other ODP-based text classifiers, and exhibits a stable performance in large-scale text classification[1, 10]. Therefore, we utilize this approach to compute the centroid of category .
|term vector of||0.67||0||0.51||0||…|
|centroid vector of||0.10||0.44||0.05||0.31||…|
For example, as shown in Table 1, the category , SocietyGovernment President is explicitly represented by the centroid vector. Given a document , however, “Trump became prez”, the ODP-based classification may not be able to classify the document as the category . This is because, this approach cannot capture the semantic relations between words (e.g., prez and president).
To complement the ODP-based classification, we adopt the word2vec [5, 6], a popular word embeddings technique. In word2vec, each word vector is trained using a shallow neural networks language model, such as continuous bag-of-words (CBOW) or Skip-gram . Skip-gram aims to predict context words given a target word in a sliding window. Mathematically, given a sequence of training words
, the objective of Skip-gram is to maximize the following average log probability:
where is the size of the context window centered at the target word, and and are the target and context words, respectively.
Trained word vectors with similar semantic meanings would be located at high proximity within the vector space. For example, the word vectors of president and prez would be located close to each other. On the other hand, the word vectors of president and casino would be located much more distantly in the embedding space. In addition, word vectors can be composed by an element-wise addition of their vector representations, e.g., Russian + river = Volga River. This property of the vectors is called “additive compositionality” . Due to the simple structure of word2vec, many previous studies have proposed variants of the word2vec model to go beyond the word-level to achieve document-, topic-, or concept-level representations [7, 11].
3 Joint Models of Explicit and Implicit Representation
In this section, we describe two joint models of ODP-based text classification and word2vec. These joint models generate category vectors, which represent the semantics of ODP categories. Each category vector not only semantically encodes the explicitly expressed ODP category, but also understands semantically related words that do not appear in the ODP knowledge base. This is because they are projected into the same semantic space as word vectors learned in an additional volume of knowledge outside the ODP.
3.1 Generating Category Vector with Algebraic Operation
Given the centroid vector of an ODP category and word vectors of the pre-trained word2vec model, our first approach generates the category vector by using the vector scalar multiplication and vector addition methods, as follows.
First, we multiply the term weights of each word in the ODP category by each word vector of the words. Second, the weighted word vectors are composed as a category vector using element-wise addition. This type of vector algebra is quite simple, but it can also clearly represent the semantics of an ODP category. This is because word vectors are not only multiplied by a precisely trained term weight from the centroid vector, but also have additive compositionality.
The logic for generating the category vector of the ODP category is as follows:
where is the category vector of , is the set of words of , is the word vector (obtained from the pre-trained word2vec model) of word , and is the term weight of in . For example, in Figure 1(a), the word vectors of , , and are multiplied by 0.44, 0.31, and 0.10, respectively, then the weighted word vectors are added. Finally, we obtain the category vector of the category SocietyGovernmentPresident. Vector representations of documents to be classified are generated in the same manner.
3.2 Generating Category Vector with Embedding
Our second approach extends word2vec to represent category vectors, instead of using the pre-trained word2vec model to compose word vectors in ODP categories. We first assign appropriate ODP categories for each word in a text corpus. Then, we train the category vectors of the assigned ODP categories by applying a modified Skip-gram model. The category vector of an ODP category is expected to represent the collective semantics of words under this category.
The process of generating category vectors with embedding is as follows. First, we identify candidate ODP categories for the target word. If an ODP category is largely associated with the target word, the ODP-based text classification selects this category as a candidate. The ODP-based text classification determines the degree of association by using the term weight of the target word in each ODP category. For example, when is the target word, the ODP-based classification identifies categories such as GameGambling and SocietyGovernmentPresident, as shown in Figure 1(b). We then select the most appropriate ODP category in the current context by using the ODP-based text classification. For example, when the context is “US President Trump urged congress”, the most appropriate category is SocietyGovernmentPresident. Finally, we apply the modified Skip-gram algorithm, which trains the category vector corresponding to the most appropriate category.
The objective of category embedding is to maximize the following average log probability:
Unlike the Skip-gram model, where the target word is used only to predict context words, the category embedding model also uses the ODP category of the target word to predict context words.
4 Semantic Similarity Measure
We develop a novel semantic similarity measure, on the basis of category and word vectors, which captures both the semantic relations between words and the semantics of ODP categories.
4.1 Using Word-level Semantics
First, we propose a semantic similarity measure that considers word-level semantics by using only the word vectors. The word vectors can be used to calculate the semantic relatedness between two words. The key idea of this measure is to align words with similar meanings in a category and document, although the words represented in this category and document are different.
Before describing the proposed measure, we explain how to compute the similarity between category and document by means of the existing ODP-based text classification as follows:
where and are non-zero terms in centroid vector of and term vector , respectively, while and are the number of non-zero terms in and , respectively. is the Dirac function defined by = 1 and = 0 .
The cosine similarity between the centroid vector of category and the term vector of document could increase whenand are equal. However, in Table 1, we observe that prez has a very similar meaning to president, which is a very important word in the category SocietyGovernmentPresident. Therefore, we propose a new measure that increases the similarity between proper and by utilizing word2vec. By substituting the Dirac function with the word similarity , it is possible to consider semantic relatedness between two words and calculate the weight more densely:
where is the word similarity function. Given two words and , we define the word similarity function in Eq. (6) as follows:
where and are the word vectors of and , is the cosine similarity between and , and is a threshold, which is empirically set to 0.6 in our analysis. The similarity between and increases not only when and are equal, but also they have similar semantics. For example, prez and president have highly similar semantics in Table 1. The semantic similarity using word-level semantics, thus, is additionally computed by 0.51 0.44 , unlike the original cosine similarity.
4.2 Using Category- and Word-level Semantics
In this paper, we develop a robust similarity measure by utilizing both the category and word vectors. A category vector is utilized as a pseudo word in the process of computing semantic similarity. A new measure can be expressed as follows:
In Eq. (8), the category vector is inserted into the corresponding category as the word. This is motivated by the fact that category vectors share the same semantic space with word vectors. Similarly, the document vector is inserted into the corresponding document as the word. We will examine how to insert a category vector as a pseudo word by determining the weight (i.e., pseudo term weight) of the category vector through many parameter experiments in Section 5.4.
|Training dataset||Test dataset|
5.1.1 Training Datasets
We use the RDF dump from the original ODP dataset released on January 8, 2017, which contains 802,379 categories and 3,624,444 webpages. To obtain a well-organized ODP taxonomy, we apply heuristic rules suggested in and build our own taxonomy with 2,735 categories. Thus, the final training dataset used in our experiments consists of 52,046 webpages. To construct the moderate-scale classification dataset, we use only 13 top-level categories from the ODP taxonomy by excluding two categories, TopNews and TopAdult, which contain fewer than 100 webpages. Thus, the training dataset used in the moderate-scale classification consists of 51,856 webpages.
In addition to the ODP dataset, we train our category embedding model and word2vec model on the “One Billion Word Language Modeling Benchmark” dataset released by Google222https://code.google.com/archive/p/word2vec/. The word and category vectors are 300-dimensional, while the window size is set to 5 with 15 negative samples.
5.1.2 Test Datasets
We build two test datasets, ODP and NYT, to evaluate our methodology. The ODP test dataset consists of webpages collected from the original ODP. The webpages in each category are randomly divided into a training set and a test set at a ratio of seven to three. In particular, we build two kinds of ODP test datasets. In the large-scale classification task, we collect 24,121 webpages from 2,735 ODP categories in our taxonomy, while collecting 24,046 webpages from 13 ODP categories in the moderate-scale classification task. In addition to the ODP test datasets, we select six categories related to the New York Times: art, business, food, health, politics, and sports, as the source for our second test dataset. We randomly collect 20 news articles from each of these categories. Table 2 shows the statistics of datasets.
5.2 Evaluation Metrics
For the ODP test dataset, we use the macro-averaging precision, recall, and -score 
as the classification performance metric. We adopt the macro-averaging, which assigns equal weights to each category instead of each test document, because the distribution of the ODP training dataset is highly skewed[1, 10]. For the NYT test dataset, we use precision at . Three participants manually assess the top-k ODP categories obtained by text classifiers in three scales: relevant, somewhat relevant, and not relevant.
5.3 Experimental Setup
and convolutional neural networks-based text classifier, which are state-of-the-art methods on multi-class text classification. In our experiments, we compare the following methods:
(baseline): This is the ODP-based text classification only .
(baseline): This is the text classification method using paragraph vectors . The learned vector representations have 1000 dimensions. We represent ODP categories by averaging the document embeddings for each document in a category. We use the cosine similarity to calculate the similarity between a category and document.
(baseline): This is the convolutional neural networks-based text classifier 
. The dimension of word embedding is 300, and the number of filters for the CNN is 900. Weights other than the word embedding layer are initialized by the Gaussian distribution, with a mean of 0 and a standard deviation of 0.01. We use the ReLU for nonlinearity. Optimization is performed using SGD with a mini-batch size of 64 with RMSProp for acceleration.
: This is our proposed text classification method using category vectors, which are generated by the joint model of ODP-based text classification and word2vec. We use the cosine similarity to calculate the similarity between a category and document vector.
: This is our proposed ODP-based text classification combined with the similarity measure of word-level semantics.
: This is our proposed ODP-based text classification combined with the similarity measure of both category- and word-level semantics.
5.4 Experimental Results
We first compare the two methods to generate category vectors with the ODP dataset (2,735 categories). In Table 3, denotes the text classification utilizing the category vector generated by algebraic operations, while denotes the text classification utilizing the category vector generated by embedding. Unexpectedly, we observe that a simple clearly outperforms a relatively elaborate . Thus, we adopt in the remaining experiments, which is simply denoted by .
Next, we perform a parameter setting to determine the term weight of a category vector as a pseudo word. Figure 2 shows the classification performance obtained by based on different values. We find that the curve reaches a peak at = 0.9. This result shows that the category vector plays a major role in the performance of . However, we observe that when the weight of category vector is 1.0, the performance drops sharply. This means that the word overlap feature is still helpful. In the remaining experiments, is set to 0.9 for .
Table 4(a) summarizes the experimental results for text classification on the ODP test dataset with 2,735 target classes. We observe that outperforms all the other proposed methods, as well as the baselines. performs better than over 9%, 12%, and 10% on average in terms of precision, recall, and F1-score, respectively. Our experimental results show that  performs worse than . In addition, it turns out that  performs the worst among the six methods. This can be explained by the fact the distribution of webpages is skewed toward a few categories in the original ODP . Actually, we observe that 73% of ODP categories contain fewer than five webpages.
We also compare the performance of with the baseline on the ODP test dataset with 13 target categories. From Table 4(b), we observe that exhibits a better performance than in the moderate-scale text classification. From Table 4, we confirm that is indeed limited to the moderate-scale text classification.
|Precision at k|
Table 5 shows the evaluation results on the NYT test dataset. Again, the performance of outperforms , , , and over 28%, 119%, 216%, 12%, and 10% in terms of precision at k on average, respectively. We also observe that both and outperform . These results clearly demonstrate that both category and word vectors are effective at text classification. Specifically, , which utilizes both category and word vectors, achieves the best performance in all experiments. We also perform the t-test for the classification results, and find that results are statistically significant with 0.01.
We also qualitatively examine the meaning of category vectors to analyze why adding category vectors improves the performance of ODP-based text classification. From Table 6, we observe that the category vector expresses the meaning of category quite well. First, from the parent category Home/Cooking/Baking_and_ Confections and child category Home/Cooking/Baking_and_Confections/Breads, we observe that their category vectors share the core semantically rich words (e.g., Recipe, Baking, Cookies), while they have their own unique semantically rich words (e.g., Dessert, Bread). These observations imply that the category vector actually understands the semantics better than the centroid vector.
Interestingly, we also observe that the category vector identifies semantically related words that do not appear in the ODP knowledge base (e.g., Henin, a Belgian former professional tennis player, in the category Sports/Tennis/Players). Thus, category vectors combined with the ODP-based classification successfully enable us to improve the performance of text classification.
6 Related Work
For the large-scale text classification, many approaches have been developed to handle data sparsity on a knowledge base. Data sparsity on a hierarchical taxonomy was firstly addressed in 
. This work applied a statistical technique to estimate the parameters of data-sparse child categories with their data-rich ancestor categories. In[1, 10], they proposed the merge-centroid (MC) classification that utilizes enriched training data for each category based on webpages classified into their ancestor and/or descendants in the ODP. In another line of work , they enriched semantic information in the ODP by incorporating another knowledge base, Wikipedia.
A simple convolutional neural network approach  has been proven to be an effective text classifier. Still, it exhibits limitations in the large-scale text classification, which is verified in our analysis. A few work [15, 16] has recently studied large-scale multi-label text classification using deep neural networks. However, they do not utilize the explicit representation model built from knowledge base. To the best of our knowledge, our current work is one of only a few works that utilizes both the explicit and implicit knowledge representation, which enables us to perform the large-scale text classification quite well.
In this paper, we have proposed novel joint models of the explicit and implicit representation techniques to handle the large-scale text classification. Specifically, we have incorporated the well-known word2vec model into the ODP-based classification framework. Our approach involves two tasks. First, we generate category vectors, which represent the semantics of ODP categories. Second, we develop a new semantic similarity measure that utilizes both category and word vectors. We have verified the large-scale classification performance of the proposed methodology using real-world datasets. The performance evaluation results confirm that our scheme significantly outperforms baseline methods. We plan to apply the proposed methodology to different applications, including contextual and mobile advertising.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (numbers 2015R1A2A1A10052665 and 2017M3C4A7077601).
-  Lee, J.H., Ha, J., Jung, J.Y., Lee, S.: Semantic contextual advertising based on the open directory project. ACM Trans. on the Web 7(4) (2013) 24:1–24:22
-  Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: SIGIR. (2007) 231–238
-  Chirita, P.A., Nejdl, W., Paiu, R., Kohlschütter, C.: Using odp metadata to personalize search. In: SIGIR. (2005) 178–185
-  Wang, Z., Wang, H.: Understanding short texts. In: ACL (Tutorial). (2016)
-  Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop). (2013)
-  Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS. (2013) 3111–3119
-  Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML. (2014) 1188–1196
-  Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP. (2014) 1746–1751
-  Shin, H., Lee, G., Ryu, W.J., Lee, S.: Utilizing wikipedia knowledge in open directory project-based text classification. In: SAC. (2017) 309–314
-  Ha, J., Lee, J.H., Jang, W.J., Lee, Y.K., Lee, S.: Toward robust classification using the open directory project. In: DSAA. (2014) 607–612
-  Cheng, J., Wang, Z., Wen, J.R., Yan, J., Chen, Z.: Contextual text understanding in distributional semantic space. In: CIKM. (2015) 133–142
-  Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: NAACL. (2015) 1275–1280
-  Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1) (1999) 69–90
-  McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML. (1998) 359–367
-  Nam, J., Kim, J., Mencìa, E.L., Gurevych, I., Fürnkranz, J.: Large-scale multi-label text classification — revisiting neural networks. In: ECML PKDD. (2014) 437–452
-  Kurata, G., Xiang, B., Zhou, B.: Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In: NAACL. (2016) 521–526