Due to rapid growth of text producing and consuming applications, numerous tools and techniques were introduced in the recent past for extracting useful patterns from unstructured text. These patterns are crucial for organizations to discover knowledge out of it and aid in making intelligent decisions. As the amount of such data grows exponentially, already available algorithms performs poor on the scalability and performance aspects. But there are still a lot of avenues where text data is yet to be exploited fully and thus we need new and efficient algorithms to tackle this situation. Platforms such as social networks, e-commerce websites, blogs and research journals generate such data in the form of unstructured text and it is essential to analyze, synthesis and process such data for efficient retrieval of useful information.
In text mining, concepts are defined as a sequence of words that constitute real or imaginary entities. Extraction of such entities are non-trivial for applications such as automated ontology generation 2]
and aspect oriented sentiment analysis
to name a few. This is the era of data explosion thus it is very difficult to store, process, manage and most importantly to extract knowledge out of it. To overcome this shortfall, a significant amount of research has been carried out in the recent past for leveraging underlying thematic and semantic structure from text archives. As a result a good number of algorithmic techniques were introduced which are proved to be efficient for the discovery of themes and semantics underlying high dimensional data.
Topic Models are suite of text understanding algorithms which statistically generate latent themes pervade a large collection of unstructured text. Since its inception, text mining researchers and practitioners are using it extensively to analyze and organize large document collections. They are unsupervised learning algorithms thus it does not require user tagged corpus to work with. A large number of topic modeling algorithms have been reported in the past with the difference in the assumption they make for modeling topics. Models such as Probabilistic topic models and Latent Dirichlet Allocation (LDA) are some such flavors of topic modeling that attained significant attention.
Contributions: This work proposes a novel unsupervised approach for learning concept hierarchies from large unstructured text corpus which is guided by a probabilistic topic modeling approach. To begin with, we model topics from the corpus using Latent Dirichlet Allocation (LDA) algorithm and then uses a lightweight linguistic process to identify concepts which are close to the real world understanding. Then we make use of a subsumption relation  (”is-a”) to connect concepts which are related thus forms a hierarchy of concepts.
Organization: The rest of this paper is organized as follows. We briefly review related works in Section 2. Section 3 introduces the novel approach we have proposed. Detailed explanation of the implementation details is presented in Section 4, and the evaluation of the proposed method is discussed in Section 5. and finally we draw a conclusion and discuss future work in Section 6.
2 Problem Definition and Related Work
2.1 Problem Definition
Here, we define the problem formally. Given a large corpus containing unstructured text documents, our problem is to automatically generate concept hierarchies which are close to human understanding. In a nutshell, this paper aims to answer the following research questions :
Is it possible to automatically extract human interpretable concepts from statistically generated topics using a lightweight linguistic process ?
Can our proposed method learn a hierarchy of such concepts incorporating a subsumption relation between them, which are important in automated ontology generation ?
Given a large but unstructured text corpus, can our topic modeling guided method better extracts and learns concept hierarchies compared to existing algorithms ?
Many recent works have been reported in this direction which proposed many algorithms to extract semantically rich concepts from plain text. In the following section, we due acknowledge some past literatures that discusses methods which are close to our proposed algorithm.
Notations used in this paper: To help narrative, some commonly used notations are shown in Table 1 that are used in the rest of this paper.
|itf||inverse topic frequency|
|normalized term frequency|
2.2 Related Work
Concept extraction is the process of extracting real or imaginary entities from plain text that has got wider recognition in the recent past. This is due to the wide variety of applications which are mainly dealing with text data such as e-commerce websites, research articles etc. Thus a significant number of research literatures are available in the field of concept extraction and mining which proposes many algorithms with varying degrees of success. In this section, we give emphasis on past literatures in automated concept extraction and hierarchy learning algorithms and briefly discuss works closely related to our proposed framework.
were two notable works that proposed algorithms for mining topical phrases from text documents. The former constructs a topic-word matrix before modeling topics but disadvantage of the approach was that creating such a matrix for large volume of text is often difficult. The latter approach makes use of a two stage process for modeling topics and mainly works with clinical documents. First it identifies phrases using some off-the-shelf tools and then trains a topic model with the identified phrases. Another work which uses topic models for generating multi-word phrases was the topical n-gram. This makes use of some switching variable for identifying a new n-gram. The assumption of this method was that the words within an n-gram usually won’t share same topic, which may not be true all the time.
Automatic Concept Extractor (ACE), a system specifically designed for extracting concepts from HTML pages and making use of the text body and some visual clues on HTML tags for identifying potential concepts was proposed by Ramirez and Mattmann . Even though this method could outperform some state of the art methods, dependency with HTML was a major drawback. Turney
proposed another system named GenEx, which employed a genetic algorithm supported rule learning mechanism for concept extraction.
A system which extracts concepts from user tag and query log dataset is proposed by Parameswaran et.al. which uses techniques similar to association rule mining. This method uses features such as frequency of occurrences and the popularity among users for extracting core concepts and attempts to build a web of concepts. Even though this algorithm can be applied to any large dataset, a lot of additional processing is required when dealing with web pages. A bag-of-word approach was proposed by Gelfand et.al.
for concept extraction from plain text and used these to form a closely tied semantic relations graph for representing relationships between them. They have applied this technique specifically for some classification tasks and found that their method produces better concepts than the Naive Bayes text classifier.
Dheeraj Rajagopal et.al. introduced another graph based approach for commonsense concept extraction and detection of semantic similarity among those concepts. They used a manually labeled dataset of 200 multi-word concept pairs for evaluating their parser capable of detecting semantic similarity and showed that their method was capable of effectively finding syntactically and semantically related concepts. The main disadvantage of that method is the use of manually labeled dataset and the creation of such dataset is time consuming and requires human effort. Another work reported in this domain is the method proposed by Krulwich and Burkey 
which uses a simple heuristics rule based approach to extract key phrases from document by considering visual clues such as the usage of bold and italic characters as features. They have shown that this technique can be extended for automatic document classification experiments.
A key phrase extraction system called Automatic Keyphrase Extraction (KEA) developed by Witten et.al was reported in the concept extraction literatures which creates a Naive Bayes learning model with known key phrases extracted from training documents and uses this model for inferring key phrases from new set of documents. As an extension to this KEA framework, Song et. al. proposed a method which uses the information gain measure for ranking candidate key phrases based on some distance and tf-idf features which was first introduced in . Another impressive and widely used method was introduced by Frantzi et. al. which extracts multi-word terms from medical documents and named as C/NC method. The algorithm uses a POS tagger POS patten filter for collecting noun phrases and then uses some statistical measures for determining the termhood of candidate multi-words.
The proposed method in this paper is a hybrid approach incorporating statistical methods such as topic modeling and tf-itf weighting and some lightweight linguistic processes such as POS tagging and analysis for leveraging concepts from text. We expect the learnt concept hierarchy to be close to the real world understanding of concepts which we will quantify using evaluation measures such as precision, recall and f-measure.
3 Background : Latent Dirichlet Allocation (Lda)
A good number of topic modeling algorithms are introduced in the recent past which varies in their method of working mainly with the assumptions they adopt for the statistical processing. An automated document indexing method based on a latent class model for factor analysis of count data in the latent semantic space has been introduced by Thomas Hofman . This generative data model called Probabilistic Latent Semantic Indexing (PLSI), considered as an alternative to the basic Latent Semantic Indexing has a strong statistical foundation. The basic assumption of PLSI is that each word in a document corresponds to only one topic.
Later, Blei et. al. introduced a new topic modeling algorithm known as Latent Dirichlet Allocation (LDA) which is more efficient and attractive than PLSI. This model assumes that a document contain multiple topics and such topics are leveraged using a Dirichlet Prior process. In the following section, we will briefly describe the underlying principle of LDA. Even though a LDA works well on broad ranges of discrete datasets, the text is considered to be a typical example to which the model can be best applied. The process of generating a document with words by LDA can be described as follows:
Choose the number of words,
, according to Poisson Distribution;
Choose the distribution over topics, , for this document by Dirichlet Distribution;
Choose a topic Multinomial
Choose a word from
Thus the marginal distribution of the document can be obtained from the above process as :
where, is derived by Dirichlet Distribution parameterized by , and
is the probability ofunder topic parameterized by . The parameter can be viewed as a prior observation counting on the number of times each topic is sampled in a document, before we actually seen any word from that document. The parameter is a hyper-parameter determining the number of times words are sampled from a topic , before any word of the corpus is observed. At the end, the probability of the whole corpus can be derived by taking the product of all documents’ marginal probability as given below:
4 Proposed Approach
In the area of text mining, topic models or specifically probabilistic topic models are suite of algorithms which got wider recognition for its ability to leverage hidden thematic information from huge archives of text data. Text mining researchers are making use of topic modeling algorithms such as Latent Semantic Analysis (LSA) , Probabilistic Latent Semantic Indexing (pLSI) , Latent Dirichlet Allocation (LDA)  etc extensively for bringing out the themes or so called ”topics” from high dimensional unstructured data.
Among all these algorithms, LDA has got lot of attention in the recent past and is widely using because of its easiness of implementation and potential applications. Even though the power of LDA algorithm has been extensively used for leveraging topics, very few studies have been reported for mapping these statistically outputted topics to semantically rich concepts. Our proposed framework is an attempt to address this issue by making use of LDA algorithm to generate topics and we leverage concepts from such topics by using a new statistical weighting scheme and some lightweight linguistic processes. The overall work flow of the proposed approach is depicted in Fig.1.
Our framework can be divided into 2 modules (i) concept extraction and (ii) concept hierarchy learning. The concept extraction module extract concepts from topics generated by LDA algorithm and the concept hierarchy learning module learns a hierarchy of extracted concepts by inducing a subsumption hierarchy learning algorithm. Detailed explanation of these modules are given below.
4.1 Concept Extraction
In this module, we introduce a topic to concept mapping procedure for leveraging potential concepts from statistically computed topics which are generated by the LDA algorithm. The first step of the proposed framework deals with the preprocessing of data which is meant for removing unwanted and irrelevant data and noises. Latent Dirichlet Allocation algorithm is executed on top of this preprocessed data which in turn generate topics through the statistical process. A total of 50 topics have been extracted by tuning the parameters of LDA algorithm. Once we got the sufficient topics for the experiment, for each topic, we have created a topic - document cluster by grouping the documents which generated such a topic and the same process has been executed for all topics under consideration.
Now, we introduce a new weighting scheme called (term frequency - inverse topic frequency) which is used for finding out highly contributing topic word in each topic. We bring this weighting scheme to filter out the relevant candidate topic words. Term frequency () is the total number of times that particular topic word comes in the topic - document clusters. Normalized term frequency, of a topic word can be calculated as:
Inverse topic frequency is calculated as:
is calculated using the following equation:
This step is followed by a sentence extraction process in which all the sentences which contain the topic words which have high tf-itf weight are extracted. Next, we apply a parts of speech tagging on these sentences and extract only noun and adjective tags as we are only concentrating on the extraction of concepts. In linguistic pre-processing step, we take Noun + Noun, Noun + Adjective and (Adjective / Noun) + Noun combinations of words from the tagged collection. Concept identification is the last step in the process flow in which we find out the term count of all the combinations of Noun + Noun, Noun + Adjective and (Adjective / Noun) + Noun. A positive term count implies that the current multi word can be a potential ”concept” and if we get a zero term count, then that multi word can be ignored. The newly proposed algorithm for extracting the concepts is shown in Algorithm 1.
4.2 Concept Hierarchy Learning
In this module we derive hierarchical organization of leveraged concepts using a type of co-occurrence called ”subsumption” relation. Subsumption relation is found to be simple but very effective way of inferring relationships between words and phrases without using any training data or clustering methods. The basic idea behind subsumption relation is very simple : for any two concepts and , is said to be subsume if 2 conditions hold. and . To be more specific, subsumes if the documents which occurs in are a subset of the documents which occurs in. Because subsumes and because it is more frequent, in the hierarchy, is the parent of .
5 Experimental Setup
This section concentrates on the implementation details of our proposed framework and concept extraction and hierarchy learning procedures are discussed in detail.
5.1 Concept Extraction
Here, concept extraction module of the framework is discussed. This module concentrates on tasks such as data collection and pre-processing, topic modeling, topic-document clustering, tf-itf weighting, sentence extraction and POS tagging, linguistic pre-processing etc for identifying concepts and a detailed explanation of each step is given below.
5.1.1 Dataset Collection and Pre-processing
for the experiment. Reuters is the world’s biggest international news agency and cater different news and related information through their website, video, interactive television and mobile platforms. Reuters Corpus Volume 1 is in XML format and is freely available for research purpose. Text messages are extracted by a thorough pre-processing such as removing XML tags, URLs and other special symbols and then created a new dataset exclusively for our experiment. BBC provides two benchmarked news article datasets which is freely available for machine learning research. The general BBC dataset consist of 2225 text documents directly from their website corresponding to stories in five areas such as business, entertainment, politics, sports and technology, from 2004 to 2005. A thorough pre-processing such as stemming, and removal of stop-word, URLs and special characters on this dataset and made an experiment ready copy of the original dataset.
5.1.2 Topic Modeling
Latent Dirichlet Allocation (LDA) algorithm has been applied on the pre-processed dataset to leverage topics for this experiment. The number of iterations is set to 300 as Gibbs sampling method usually approaches the target distribution after 300 iterations. The number of topics is set to 50 and a snapshot of 5 topics we have randomly chosen is shown in Table 2.
|Topic 1||Topic 3|
|web [0.0048]||set [0.0047]|
|search [0.0048]||software [0.0032]|
|online [0.0047]||virus [0.0028]|
|news [0.0046]||users [0.0027]|
|google [0.0033]||firms [0.0025]|
|people [0.0032]||microsoft [0.0025]|
|information [0.0032]||security [0.0022]|
|internet [0.0029]||windows [0.0022]|
|website [0.0027]||file [0.0013]|
|users [0.0020]||programs [0.0011]|
|Topic 2||Topic 4|
|system [0.0064]||site [0.0042]|
|music [0.0045]||net [0.0038]|
|devices [0.0043]||spam [0.0035]|
|players [0.0035]||mail [0.0028]|
|media [0.0032]||firm [0.0027]|
|digital [0.0027]||data [0.0024]|
|market [0.0024]||attacks [0.0019]|
|technology [0.0022]||network [0.0018]|
|consumer [0.0021]||web [0.0016]|
|technologies [0.0018]||research [0.0014]|
5.1.3 Topic - Document Clustering
In this step, we consider each topic and then grouped and clustered top 50 documents which contributed the creation of that specific topic. This has been done for all the 50 topics of our choice. As an outcome, we have got 50 such clusters that contain documents which generated the topics.
5.1.4 TF-ITF Weighting
Here, we compute the weight of each word in every topic using Eq.(3), Eq.(4) and Eq.(5) to find out highly used topic words in the collection. Table 2 also shows topic words along with their tf-itf weight.
|Concepts_Topic 1||Concepts_Topic 3|
|web search||music players|
|search engine||digital media|
|google news||digital technology|
|online news search||consumer devices|
|google search engine||market system|
|Concepts_Topic 2||Concepts_Topic 4|
|software users||spam mail|
|virus programs||spam website|
|windows security||network research|
|software forms||research firm|
|microsoft programs||website attacks|
5.1.5 Sentence Extraction & POS Tagging
In sentence extraction step, we consider topic words having highest tf-itf weight and then extract sentences containing these topic words from the topic - document clusters. Then a parts of speech tagging has been done to identify words tagged as nouns and adjectives from these sentences as our aim is to extract potential ”concepts” from the repository. For this experiment, Natural Language Toolkit (NLTK)  has been used which contains libraries for Natural Language Processing for Python programming language.
5.1.6 Linguistic Processing & Concept Identification
All words which are tagged as Nouns(NN/NNP/NNS) and Adjectives (JJ) are filtered out and all possible combinations of and . The results are shown in Table 3. The term count for each of these multi word term is then calculated against the original corpus and a positive term count implies that the corresponding multi-word term can be a potential concept and we eliminate the term if we get a zero term count. This process has been repeated for all the multi-words we have filtered out.
5.2 Concept Hierarchy Learning
Concept hierarchy learning module concentrates on leveraging a subsumption hierarchy depicting an ”is-a” relation between the concepts identified by the proposed algorithm. Subsumption relation is simple but considered as an important relationship type in any ontological structure and we calculate two probability conditions for the same. For any given two concepts, we first calculate and then , in order to establish a subsumption relation, the former probability must be 1 and the latter should be less than 1. In other words, subsumes if the documents in which occurs is a subset of the documents which occurs in.
For instance, consider two concepts dial-up internet and network connection, the proposed method computes and and found that the number of documents in which occurs is a subset of number of documents in which occurs. That means there exists a subsumption relation between these two concepts and concept may be subsumed by concept. This process has been repeated for all concepts in the collection, and a part of such a hierarchy generated using our proposed algorithm is shown in Fig. 2.
6 Evaluation of Results
Here we evaluate the results produced by our proposed method and precision and recall measures are used for evaluating the quality of concepts leveraged. We have first created a human generated concept repository and kept for verifying against the machine generated concepts. Precision computes the fraction of machine extracted concepts that are also human generated, and recall measures concepts which are extracted by proposed algorithm that are also human authored. In information retrieval, it is estimated that achieving high precision and recall at same time is difficult and using a measure called F1, we can balance these two. Here, true positive is defined as the number of overlapped concepts between human generated concepts and concepts extracted by our proposed algorithm, false positive is the number of extracted concepts that are not truly human authored concepts and false negative is the human authored concepts that are missed by the concept extraction method. Using these measures, we have compared our proposed method against some of the existing concept extraction algorithms and the result is shown in Table 4.
From the performance graph shown in Figure 4, it is clear that our proposed algorithm extracts more concepts as the number of topics are increasing. The other baseline algorithms such as ACE and ICE performs poor when the number of topics are increased randomly. This shows that the proposed algorithm outperforms the baseline algorithms when extracting real-world concepts from large number of statistically generated topics.
7 Conclusions and Future Work
This paper proposed a novel framework for extracting close to real world concepts from large collection of unstructured text documents which is guided by a probabilistic topic modeling algorithm. Proposed method also deals with learning a subsumption hierarchy which exploits ”is-a” relationships among identified concepts which is extensively used in ontology generation. Experiments conducted on large datasets such as Reuters and BBC news corpus shows that the proposed method outperforms some of the already available algorithms and better concept identification is possible with this framework.
Because of the promising end results, we are interested to work mainly on the directions of measuring the scalability of proposed framework by using more large datasets. Apart from the basic subsumption hierarchy which depicts ”is-a” relation, our future work will be on leveraging other relations that exist between concepts we would like to so that a this framework can automate the complete ontology generation process.
-  Pospiech, Sebastian, Martin Pelke, and Robert Mertens. Semi-automated Ontology Creation for Semantic Search in Business Process Exploration IEEE Tenth International Conference on Semantic Computing (ICSC)., 2016.
Marujo, Luís, et al. Exploring events and distributed representations of text in multi-document summarization.Knowledge-Based Systems, 94, 33-42, 2016.
Manek AS, Shenoy PD, Mohan MC and Venugopal KR. Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier.World Wide Web, 1-20, 2016.
-  Steyvers M and Griffiths T. Probabilistic topic models. Handbook of latent semantic analysis., 427(7):424-40, 2007.
-  Sanderson M and Croft B. Deriving concept hierarchies from text. InProceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 206-213, 1999.
-  Lindsey RV, Headden III WP and Stipicevic MJ. A phrase-discovering topic model using hierarchical pitman-yor processes. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, (pp. 214-222), 2012.
-  El-Kishky A, Song Y, Wang C, Voss CR and Han J. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment., 8(3):305-16, 2014
-  Wang X, McCallum A and Wei X. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. InSeventh IEEE International Conference on Data Mining (ICDM 2007), pp. 697-702, 2007.
-  Ramirez PM, Mattmann CA. ACE: improving search engines via Automatic Concept Extraction. InInformation Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on (2004), pp. 229-234
-  Turney PD. Learning algorithms for keyphrase extraction. Information Retrieval., 2(4):303-36, 2000.
-  Parameswaran A, Garcia-Molina H and Rajaraman A. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment., 3(1-2):566-77, 2010.
-  Gelfand B, Wulfekuler M and Punch WF. Automated concept extraction from plain text. In AAAI 1998 Workshop on Text Categorization., pp. 13-17, 1998.
-  Rajagopal D, Cambria E, Olsher D and Kwok K. A graph-based approach to commonsense concept extraction and semantic similarity detection. In Proceedings of the 22nd international conference on World Wide Web companion, pp. 565-570, 2013.
-  Krulwich B and Burkey C. Learning user information interests through extraction of semantically significant phrases. InProceedings of the AAAI spring symposium on machine learning in information access, pp. 100-112, 1996.
-  Witten IH, Paynter GW, Frank E, Gutwin C and Nevill-Manning CG. KEA: Practical automatic keyphrase extraction. InProceedings of the fourth ACM conference on Digital libraries., pp. 254-255, 1999.
-  Song M, Song IY and Hu X. KPSpotter: a flexible information gain-based keyphrase extraction system. InProceedings of the 5th ACM international workshop on Web information and data management., pp. 50-53, 2003.
-  Frantzi K, Ananiadou S and Mima H. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries., 3(2):115-30, 2000.
-  Hofmann T. Probabilistic latent semantic indexing. InProceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50-57, 1999.
-  Blei DM, Ng AY and Jordan MI. Latent dirichlet allocation. the Journal of machine Learning research., 3:993-1022, 2003.
-  Dumais ST. Latent semantic analysis. Annual review of information science and technology., 38(1):188-230, 2004.
-  Hofmann T. Probabilistic latent semantic indexing. InProceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50-57, 1999.
-  Blei DM, Ng AY and Jordan MI. Latent dirichlet allocation. Journal of machine Learning research., 993-1022, 2003.
-  Bird S. NLTK: the natural language toolkit. InProceedings of the COLING/ACL on Interactive presentation sessions, pp. 69-72, 2006.
-  Lewis, D, Yang Y, Rose T and Li F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004.
-  Greene D and Cunningham P. Practical solutions to the problem of diagonal dominance in kernel document clustering. InProceedings of the 23rd international conference on Machine learning, pp. 377-384, 2006.