Text classification is the task of automatically assigning text to one or more categories. It plays an important role in many applications, such as contextual advertising [1, 2], web search personalization , and personalized curation system . To capture various topics in arbitrary texts, text classification requires a sufficiently large taxonomy of topical categories  and a large amount of training data for each category.
due to its representation ability of various categories. The ODP contains approximately one million categories and millions of web pages, pre-classified into ODP categories. It is a large-scale web directory built by human editors. However, manually constructed knowledge bases exhibit a limited semantic information. Especially the ODP suffers from the scarcity of entities, which may degrade the ODP-based text classification.
An entity is a distinctly identifiable “thing”, e.g., a specific person, event, or place, and is important to understand text [7, 8]. For example, given text “I searched for Galaxy Nexus spec”, “Galaxy Nexus” is the most important evidence that this text is relevant to cellular phone-related topics. However, because there is no information about the entity “Galaxy Nexus” in the ODP, the ODP-based classifier would misclassify the text into the ODP category /Science/../Galaxies.
In this paper, we propose a new approach to enrich the semantics of ODP categories with entities, which facilitates the effectiveness of the ODP-based classification. To enrich ODP categories with entities, we leverage a knowledge base called Probase222https://concept.research.microsoft.com. Probase is a probabilistic knowledge base and it contains millions of entities and concepts. One of the advantages of Probase is that in comparison with the well-known knowledge bases, such as WordNet333https://wordnet.princeton.edu or Yago444http://www.yago-knowledge.org, it has high taxonomy coverage555The taxonomy coverage means that how much the taxonomy contains at least one term. of texts, which enables it to understand the semantic meaning of the text more thoroughly . To the best of our knowledge, this is the first work to combine the two well-known knowledge bases, ODP and Probase, for the ODP-based text classification.
Figure 1 provides an overview of our methodology. To enrich the semantics of ODP categories with Probase entities, we first represent ODP categories and Probase entities in terms of concepts (refer to 1⃝ and 2⃝ in Figure 1). Next, we measure the semantic relevance between the ODP category and the Probase entity using concept representations (refer to 3⃝ in Figure 1
). We compute the semantic relevance for all categories and re-scale this scores as a probability. Finally, based on the probability, we add the entity into the top-related ODP categories (refer to 4⃝ in Figure 1).
To illustrate our methodology, we assume that there is a Probase entity “Galaxy Nexus”. We represent “Galaxy Nexus” with concepts, such as “phone” or “cellular_phone”. Similarly, we represent each ODP category with concepts, e.g., the ODP category /Shopping/…/Cellular_Phone with concepts, such as “phone” or “cellular_phone”. Next, we measure the semantic relevance between the “Galaxy Nexus” and each ODP category, utilizing concept representations. Finally, we add the “Galaxy Nexus” into the top- related ODP categories, e.g., the phone-related ODP categories, such as /Shopping/…/Cellular_Phone, based on the relevance score.
The technical contributions of our work are summarized as follows:
We improve the ODP-based text classification by enriching the semantics of ODP categories with Probase entities.
We propose a new approach to represent each ODP category and Probase entity as a concept representation to measure the semantic relevance between an ODP category and a Probase entity.
We demonstrate the effectiveness of our methodology by evaluating the classification performance. The results clearly show that our methodology significantly outperforms current state-of-the-art techniques.
The remainder of this paper is structured as follows. In Section II, we introduce knowledge bases and the classification framework adopted from the work , . We then describe our proposed methodology in Sections III, IV and V. We show and analyze the performance results in Section VI. We discuss related work in Section VII and conclude in Section VIII.
Probase in Figure 2 is an automatically constructed knowledge base, built with linguistic patterns derived from billions of web pages. Probase contains millions of entities and concepts and formalized relationships between entities and their enclosing concept classes with their co-occurrence frequency, e.g., entity “Galaxy Nexus” and concept “smartphone”. Probase data are publicly available; thus, many researchers have utilized this knowledge base for tasks including text conceptualization , , short text classification  and taxonomy construction .
Ii-B Open Directory Project
ODP in Figure 3 is a large-scale web directory built by human editors. It contains approximately one million ODP categories and millions of web pages pre-classified into categories. The ODP categories have a hierarchical structure; there is an IS-A relationship between a parent and a child category. ODP data are publicly available; thus, it is widely applied to various research areas, such as web classification , , contextual advertising , , and personalized curation .
Ii-C ODP-based Classification
We employ  for constructing the ODP-based text classifier. The ODP is a large-scale web directory built by human editors. The ODP categories have a hierarchical structure, which has IS-A relationship between a parent and a child category [1, 14]. To train the ODP-based text classifier, we compute the centroid of ODP category by averaging all term vectors across all ODP documents as follows:
where is a set of ODP documents in ODP category and is a weighted vector represented as a tf-idf value. However, some ODP categories have the data sparsity problem, since 58% of categories have less than five texts that classified into themselves . Therefore, we employ [1, 10] that has merged the centroid vector of the ancestor and descendant ODP categories to build a classifier. As a result, it alleviate the data sparsity problem in the ODP.
Iii Representing Shared semantics of ODP category and Probase entity
In this paper, we enrich the semantics of ODP categories with Probase entities, which facilitates the effectiveness of the ODP-based classification. To do so, we add semantically related Probase entities to ODP categories by measuring the semantic relevance between them. Thus, it is a challenge to measure the relevance between ODP category and Probase entity, since there is a structural difference in the heterogeneous knowledge bases. The ODP consists of categories and documents, whereas Probase consists of concepts and entities. Therefore, it is demanding to find a common representation of semantic information that can be shared between them.
As a solution to this problem, we represent ODP categories and Probase entities in terms of concepts. In this paper, we define the concept as a class of entities or categories within the domain . For example, the Probase entity “Galaxy Nexus” can include a set of concepts such as “phone” or “cellular_phone”. Likewise, the ODP category /Shopping/../Cellular_Phone can include the concepts such as “phone” or “cellular_phone”. Thus, if an ODP category and a Probase entity belong to similar classes, their concept representations will be similar. Based on this observation, we represent ODP categories and Probase entities in terms of concept vectors and measure the semantic relevance between them.
Iv Concept Representation of ODP Categories
In this section, we describe how to represent ODP categories in concepts (refer to 1⃝ in Figure 1). Specifically, we introduce a way for searching for concepts in ODP categories and representing them as concept vectors. We then explain how to enrich the diversity of concepts in each ODP category to better represent its semantics.
Iv-a Representing ODP Category with Concept Vector
We represent ODP categories using concepts. However, there is no concept information in the ODP, which makes difficult to represent concept vectors of ODP categories. Thus, we propose a methodology for finding related concepts in each ODP category. Given a set of ODP documents, we break down the ODP document into text segments. Then, we match syntactic relationships between segments and concepts in Probase. We regard these segments as candidate concepts in the ODP category.
After searching for concepts in ODP documents, we measure representativeness of each candidate concept in ODP category. To measure representativeness, we apply the - scheme. Specially, we use because concepts of high frequency in ODP document are likely to be more important and descriptive for the ODP category. Likewise, we use because it discriminates the degree of semantic importance of a concept. We use from the documents in both of the knowledge bases, the ODP and Probase. We compute a concept weight in the ODP category as follows:
and are the - based concept weights in category . and is the number of source documents from ODP and Probase where concept appears. is a term frequency of concept appeared in ODP category . We take the logarithm to make concepts with low frequencies have low weights in the ODP concept vector.
Iv-B Enriching Concept Information of ODP Category
When we represent ODP categories as concept vectors, we encounter the scarcity problem of concept lists in each ODP category. This problem is caused by the deficiency of training data in each ODP category. In our experiment regarding the ODP, approximately 72% of ODP categories have less than five documents classified into themselves. The deficiency of training documents leads to the poor concept representation of an ODP category, which complicates the measurement of the relatedness between an ODP category and a Probase entity.
As a solution to this problem, motivated by , , we increase the number of concepts in an ODP category using the hierarchy structure of the ODP. Specifically, we increase the number of concepts that are found in the descendants of a specific ODP category. Thus, we enrich the diversity of a concept vector, in the ODP category as follows:
Let be the enriched concept vector of ODP category , we merge the concept vector of the descendant ODP categories as follows:
where is a set of child categories of the ODP category and is a concept vector of the ODP category . Enriched concept vector is a linear combination of concept vector and the sum of enriched concept vectors of child categories of ODP category .
V Enriching ODP Categories with Probase Entities
In this section, we describe how to enrich the semantics of ODP categories with Probase entities. Specifically, we introduce how to represent Probase entities with concepts. We then measure the semantic relevance between an ODP category and a Probase entity. Finally, we add Probase entities to the related ODP categories based on the semantic relevance.
V-a Concept Representation of Probase Entity
The second part of the methodology (refer to 2⃝ in Figure 1) is to represent Probase entities with concepts. Given a specific Probase entity, we obtain a list of related concepts, which are already provided in Probase. For example, the related concepts of the Probase entity “Galaxy Nexus” include “smartphone”, “product” or “multi_touch_phone”. Among these concepts, however, humans are less likely to associate the entity “Galaxy Nexus” with general concepts, such as “product” because they are used with many entities. Likewise, specific concepts, such as “multi_touch_phone” cannot be regarded as representative because they are not frequently used with entities.
To assign appropriate concepts to entities, many researchers , ,  rely on a probabilistic approach called typicality. In this paper, we borrow the typicality proposed by  as a score function. This function assigns a high representative score to concepts that are not too general nor specific. We compute typicality score of the Probase entity as follows:
is the co-occurrence of and . This frequency information is already provided in Probase. is the typical score of Probase entity in Probase concept , and it assigns higher scores to specific concepts of the entity . In contrast, is the typical score of Probase concept for entity , and it assigns higher scores to more general concepts of entity . Thus, consideration of both and facilitates assigning a high representative score to concepts that are not too general or specific. After scoring the typicality of all concepts of the entity , we selectively choose the concepts, whose typicality scores exceed a certain threshold , based on our preliminary experiments.
V-B Relevance between ODP Category and Probase Entity
The last part of the methodology (refer to 3⃝ in Figure 1) is to measure the semantic relevance between ODP category and Probase entity. We compare concept vectors of an ODP category and a Probase entity. For example, the ODP category /Shopping/../Cellular_Phone and the Probase entity “Galaxy Nexus” are similar, since they share common concepts, such as “phone” or “cellular_phone”.
Given an ODP category and a Probase entity , we define the semantic relevance score between ODP category and Probase entity as follows:
where is a set of concepts in concept vector of the ODP category , and is a set of concepts in concept vector of Probase entity . In addition, is a similarity score between two concepts and , which is already provided in Probase. is concept vector of ODP category and is concept vector of Probase entity .
Next, we measure the semantic relevance for all categories and re-scale this scores as a probability using softmax. Based on this probability, we rank ODP categories for the Probase entity and select the top- ODP categories as related categories for the Probase entity. Finally, we add the Probase entity into the related ODP categories (refer to 4⃝ in Figure 1).
Vi Performance Experiment
Table I shows statistics of the datasets and Table II shows the selected parameter values. These values are determined empirically.
Vi-A1 ODP Dataset
We use the ODP RDF dump, released on October 2014, from the original ODP dataset. It contains 796,902 ODP categories and 3,917,043 web pages. To obtain a well-organized taxonomy, we apply heuristic rules described in. As a result, we use 4,521 categories to build a taxonomy.
|Test Dataset||Probase Entity||Entities||115|
Vi-A2 Probase Dataset
We use the Probase dataset, released on July 2013, to enrich the ODP-based text classification. It contains 6,215,858 entities and 2,359,856 concepts. To utilize only representative concepts for Probase entities, we choose the concepts whose the typicality scores () exceed 0.004, based on our preliminary experiments. Then, we select Probase entity that has at least one concept representing the Probase entity. As a result, 1,000,500 Probase entities are used for enriching the semantics of ODP categories.
Vi-A3 Probase Entity Dataset
We use Probase entity dataset to evaluate the matching performance between ODP categories and entities. We evaluate this to show that our proposed methods helps to enrich the ODP categories with the semantically related entities. This dataset is randomly selected 115 entities from Probase, covering different topics such as fashion, movie, sports, and health.
Vi-A4 News Dataset
We use New York Times (NYT) news dataset to evaluating the classification performance on a real-world dataset. We select five categories: art, business, fashion, movie, and sports in the news categories. Then, we randomly collect approximately 20 news articles from these news categories, where each news article includes at least one Probase entity. We use precision at
as an evaluation metric in the same fashion as.
|weight of entity||0.8|
Vi-B Experimental Setup
For the assessment, three researchers manually assess the ODP categories obtained by the ODP-based text classifiers according to three scales: relevant, somewhat relevant and not relevant. We use precision at as an evaluation metric for both of the Probase Entity dataset and the news dataset. For each entity or news, we manually annotate the top- categories selected by each method and we measure the precision at each position . For example, if three categories out of the top-five categories are relevant, then the precision at five is measured to be 0.6.
We evaluate the performance of the following five models:
ODP: The ODP-based text classifier . It is the baseline for the ODP-based text classification.
ODP + Wiki: The ODP-based text classification enriched with Wikipedia phrases and relevant hyperlinks . It is the state-of-the-art method for the ODP-based text classification.
ODP + : The ODP-based text classification enriched with Probase entities. This method obtains concepts of each ODP category from its path information. It represents ODP categories as the based concept vectors.
ODP + : The ODP-based text classification enriched with Probase entities. This method obtains concepts of each ODP category from web pages in the ODP category. It represents ODP categories as the based concept vectors.
ODP + (proposed model): The ODP-based text classification enriched with Probase entities. This method obtains concepts of each ODP category from web pages in the ODP category. It represents ODP categories as the - based concept vectors.
Vi-C Experimental Results
Vi-C1 Parameter Setting
Table II shows the parameters of our methodology. We set the merge ratio () as 0.7 to generate merged-concept vectors. We selectively choose the concepts whose the typicality scores () exceed 0.004, based on our preliminary experiments. If typicality score of the concept is less than 0.004, it means the concept is too general or specific concept.
We use the parameter to determine the optimal ratio of Probase entities and ODP words when we classify the text. We set parameters as follows: Probase entity + (1-) ODP words. Figure 4 shows the classification performance based on different parameters. We find that the performance of text classification increases as increased, up to 0.8; the curve reaches a peak at = 0.8. We observe that raising the importance of entity improves text classification performance. In addition, we find that when = 1.0, the performance significantly decreased. This suggests that both ODP terms and added Probase entities contribute largely to the ODP-based text classification. In the rest of the experiments, we therefore set to 0.8.
|ODP + Wiki ||0.505||0.460||0.455|
Vi-C2 Results on Matching Performance
Table III shows the matching performance on Probase entity dataset. In this experiment, we exclude the matched result of ODP method because this method only utilizes the ODP knowledge base without adding entities. In result, our proposed method ODP+ outperforms ODP+Wiki by 17%, 26%, and 25% on precision at one, precision at three, and precision at five, respectively. This implies that our proposed method helps to enrich the ODP categories with the semantically related entities, since it grasps the semantic relationship between ODP category and Probase entity very well.
In addition, compared with ODP + , ODP + achieves much better performance, which implies that increasing the diversity of concepts in each ODP category helps to represent ODP categories more semantically. Lastly, compared with ODP + , ODP + achieves much better performance. It demonstrates the effectiveness of adopting document frequency when we represent ODP categories as a concept vectors.
To demonstrate statistical significance, we also perform the -test on the classification results. The dagger symbol () indicates a -value0.05, while the double dagger symbol () indicates a -value0.01.
Vi-C3 Results on News Dataset
Table IV shows the classification performance on news dataset. In result, the proposed method ODP + outperforms all the other methods. It shows that enriching the semantics of ODP categories with Probase entities facilitates the large-scale text classification. More specifically, ODP + and ODP + Wiki have better performance than ODP, which means additional knowledge is useful to improve the performance of the ODP-based text classification. Morever, the proposed method ODP + has better performance than ODP + Wiki. It means that enriching the ODP categories with the semantically related entities improves actual classification performance. We give a more detailed qualitative analysis in the next subsection.
Vi-D Qualitative Analysis
Table V shows the matching results for entity “Galaxy Nexus”. The bold ODP category is the correctly matched ODP category with the entity. In matching result from ODP + Wiki, there are two correctly matched ODP categories for “Galaxy Nexus”. In contrast, all ODP categories from ODP + are correctly matched ODP categories with the entity. It shows that ODP + grasps the semantic relationship between ODP category and Probase entity better than ODP + Wiki.
Table VI shows the classification result for the query “Galaxy Nexus Spec” including the entity “Galaxy Nexus”. In the ODP result, it misclassifies the text into science-related ODP category because there is no information about the entity “Galaxy Nexus” in ODP. In contrast, ODP + or ODP + Wiki correctly classify the text into phone-related categories since they can understand the semantic meaning of “Galaxy Nexus”. Thus, adding entities to the relevant ODP category facilitates the text classification. In th e case of the proposed method ODP + , there are more correctly classified categories compared with ODP + Wiki. It shows that our proposed model grasps the semantic relationship between ODP category and Probase entity very well and it improves text classification performance.
|Galaxy Nexus||ODP + Wiki ||1||/Shopping/Consumer_Electronics/Communications/Wireless/Cellular_Phones|
|Galaxy Nexus Spec||ODP ||1||/Sports/Motorsports/Auto_Racing/Organizations/SCCA|
|ODP + Wiki ||1||/Computers/Internet/Searching/Search_Engines|
Vii Related Work
and Naive Bayes with bag-of-words (BoW) features. Due to the limitations of the BoW approach, many approaches have been developed for enriching semantic information by leveraging search engines or external knowledge base.
To obtain semantics information through search engines,  and  have expanded input text using search engines for classification. This approach differs from our proposed method in that it enriches the semantics of the input text. However, this expansion may not be suitable for real-time applications because it is very time consuming and heavily dependent on the quality of search engine .
In another line of work, several studies [5, 22] have enriched the semantics of categories in classifier using knowledge base, such as Wikipedia or WordNet. In particular,  has employed Wikipedia to enrich the semantics of ODP categories with phrases and hyperlinks for the ODP-based text classification. However, the coverage of additional semantic information is limited because the average number of Wikipedia phrases per ODP document is approximately 0.5 . In addition, it might contain Wikipedia phrases, which have little relevance to the ODP category, such as Wikipedia phrase “service plan” in the ODP category /Shopping/../Cellular_Phone.
In this paper, we have sought to enrich the semantics of ODP categories with Probase entities to better understand the semantics of text. Our proposed scheme has involved three tasks. First, we have represented each ODP category and Probase entity with concept representations. Second, we have measured the semantic relevance between an ODP category and a Probase entity. Finally, we have enriched ODP categories with related Probase entities based on the measured semantic relevance. We have verified the superiority of our proposed methodology in large-scale text classification on a real-world dataset. We plan to apply the proposed methodology to real-world applications, including contextual advertising and mobile advertising.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (number 2015R1A2A1A10052665).
-  J.-H. Lee, J. Ha, J.-Y. Jung, and S. Lee, “Semantic contextual advertising based on the open directory project,” ACM Transactions on the Web, vol. 7, no. 4, pp. 24:1–24:22, 2013.
-  W.-J. Ryu, J.-H. Lee, and S. Lee, “Utilizing verbal intent in semantic contextual advertising,” IEEE Intelligent Systems, vol. 32, no. 3, pp. 7–13, 2017.
-  P. A. Chirita, W. Nejdl, R. Paiu, and C. Kohlschütter, “Using odp metadata to personalize search,” in Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 178–185.
-  J. Ha, J.-H. Lee, K.-S. Shim, and S. Lee, “Eui: an embedded engine for understanding user intents from mobile devices,” in Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 1935–1936.
-  H. Shin, G. Lee, W.-J. Ryu, and S. Lee, “Utilizing wikipedia knowledge in open directory project-based text classification,” in Proceedings of the 32nd Symposium on Applied Computing, 2017, pp. 309–314.
-  Y. Song, S. Liu, X. Liu, and H. Wang, “Automatic taxonomy construction from keywords via scalable bayesian rose trees,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp. 1861–1874, 2015.
-  J. Dalton, L. Dietz, and J. Allan, “Entity query feature expansion using knowledge base links,” in The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2014, pp. 365–374.
-  J. Pound, P. Mika, and H. Zaragoza, “Ad-hoc object retrieval in the web of data,” in Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 771–780.
-  W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxonomy for text understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 481–492.
J. Ha, J.-H. Lee, W.-J. Jang, Y.-K. Lee, and S. Lee, “Toward robust
classification using the open directory project,” in
Proceedings of the International Conference on Data Science and Advanced Analytics, 2014, pp. 607–612.
Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen, “Short text conceptualization
using a probabilistic knowledgebase,” in
Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011, pp. 2330–2336.
-  Z. Wang, K. Zhao, H. Wang, X. Meng, and J.-R. Wen, “Query understanding through knowledge-based conceptualization.” in Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015, pp. 3264–3270.
-  F. Wang, Z. Wang, Z. Li, and J.-R. Wen, “Concept-based short text classification and ranking,” in Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014, pp. 1069–1078.
-  A. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng, “Improving text classification by shrinkage in a hierarchy of classes,” in Proceedings of the Fifteenth International Conference on Machine Learning ICML, 1998, pp. 359–367.
-  T. Lee, Z. Wang, H. Wang, and S.-w. Hwang, “Attribute extraction and scoring: A probabilistic approach,” in Proceedings of the 29th IEEE International Conference on Data Engineering, 2013, pp. 194–205.
-  Z. Wang, H. Wang, J.-R. Wen, and Y. Xiao, “An inference approach to basic level of categorization,” in Proceedings of the 24th ACM International Conference on Information and Knowledge Management, 2015, pp. 653–662.
-  T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” In Proceeding of the 10th European Conference on Machine Learning, pp. 137–142, 1998.
N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,”Machine learning, vol. 29, no. 2-3, pp. 131–163, 1997.
-  D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang, “Query enrichment for web-query classification,” ACM Transactions on Information Systems, vol. 24, no. 3, pp. 320–352, 2006.
-  A. Sun, “Short text classification using very few words,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 1145–1146.
-  M. Chen, X. Jin, and D. Shen, “Short text classification improved by learning multi-granularity topics,” in Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011, pp. 1776–1781.
M. Rodriguez, J. Hidalgo, and B. Agudo, “Using wordnet to complement training
information in text categorization,” in
Proceedings of the International Conference on Recent Advances in Natural Language Processing, vol. 97, 2000, pp. 353–364.