Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

08/29/2017 ∙ by Xiaofeng Zhu, et al. ∙ Intel Northwestern University 0

In this paper, we present hierarchical relationbased latent Dirichlet allocation (hrLDA), a data-driven hierarchical topic model for extracting terminological ontologies from a large number of heterogeneous documents. In contrast to traditional topic models, hrLDA relies on noun phrases instead of unigrams, considers syntax and document structures, and enriches topic hierarchies with topic relations. Through a series of experiments, we demonstrate the superiority of hrLDA over existing topic models, especially for building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the settings of noisy data sets, which are likely to occur in many practical scenarios. Our ontology evaluation results show that ontologies extracted from hrLDA are very competitive with the ontologies created by domain experts.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Although researchers have made significant progress on knowledge acquisition and have proposed many ontologies, for instance, WordNet [1], DBpedia [2], YAGO [3], Freebase, [4] Nell [5], DeepDive [6], Domain Cartridge [7], Knowledge Vault [8], INS-ES [9], iDLER [10], and TransE-NMM [11], current ontology construction methods still rely heavily on manual parsing and existing knowledge bases. This raises challenges for learning ontologies in new domains. While a strong ontology parser is effective in small-scale corpora, an unsupervised model is beneficial for learning new entities and their relations from new data sources, and is likely to perform better on larger corpora.

In this paper, we focus on unsupervised terminological ontology learning and formalize a terminological ontology as a hierarchical structure of subject-verb-object triplets. We divide a terminological ontology into two components: topic hierarchies and topic relations. Topics are presented in a tree structure where each node is a topic label (noun phrase), the root node represents the most general topic, the leaf nodes represent the most specific topics, and every topic is composed of its topic label and its descendant topic labels. Topic hierarchies are preserved in topic paths, and a topic path connects a list of topics labels from the root to a leaf. Topic relations are semantic relationships between any two topics or properties used to describe one topic. Figure 1 depicts an example of a terminological ontology learned from a corpus about European cities. We extract terminological ontologies by applying unsupervised hierarchical topic modeling and relation extraction to plain text.

Fig. 1: A representation of a terminological ontology. (Left: topic hierarchies) Topic is composed of , , , , etc. and are two topic paths. (Right: topic relations) Every topic label has relations to itself and/or with other labels. is one relation/property of topic . is one relation of topic to .

Topic modeling was originally used for topic extraction and document clustering. The classical topic model, latent Dirichlet allocation (LDA) [12], simplifies a document as a bag of its words and describes a topic as a distribution of words. Prior research [13, 14, 15, 16, 17, 18, 19] has shown that LDA-based approaches are adequate for (terminological) ontology learning. However, these models are deficient in that they still need human supervision to decide the number of topics, and to pick meaningful topic labels usually from a list of unigrams. Among models not using unigrams, LDA-based Global Similarity Hierarchy Learning (LDA+GSHL) [14] only extracts a subset of relations: “broader” and “related” relations. In addition, the topic hierarchies of KB-LDA [18] rely on hypernym-hyponym pairs capturing only a subset of hierarchies.

Considering the shortcomings of the existing methods, the main objectives of applying topic modeling to ontology learning are threefold.

  1. In topic models, a topic is usually represented with a list of unigrams. In a terminological ontology, a topic/entity needs to be represented with a more descriptive identifier (i.e., noun phrase). Currently, the number of topics is usually a fixed parameter, which restricts the number of classes an ontology could have. For instance, it is difficult to add a new species to an animal ontology.

  2. Both relations among different noun phrases and relations/properties (see the relations in Figure 1) for describing single noun phrases should be captured during the topic generation process.

  3. Hierarchies need to be built on topical affiliations. If topic is a sub-topic of topic , has a more specific meaning than . The depth of each topic path should be determined by a data-driven method.

To achieve the first objective, we extract noun phrases and then propose a sampling method to estimate the number of topics. For the second objective, we use language parsing and relation extraction to learn relations for the noun phrases. Regarding the third objective, we adapt and improve the hierarchical latent Dirichlet allocation (hLDA) model

[20, 21]. hLDA is not ideal for ontology learning because it builds topics from unigrams (which are not descriptive enough to serve as entities in ontologies) and the topics may contain words from multiple domains when input data have documents from many domains (see Section II and Figure 9). Our model, hrLDA, overcomes these deficiencies. In particular, hrLDA represents topics with noun phrases, uses syntax and document structures such as paragraph indentations and item lists, assigns multiple topic paths for every document, and allows topic trees to grow vertically and horizontally.

The primary contributions of this work can be specified as follows.

  • We develop a hierarchical topic model, hrLDA, that does not require one to set the topic number at every level of a topic tree or to set the topic path lengths from the root to leaves.

  • We integrate relation extraction into topic modeling leading to lower perplexity.

  • We propose a multiple topic path drawing strategy, which is an improvement over the simple topic path drawing method proposed in hLDA.

  • We present automatic extraction of terminological ontologies via hrLDA.

The rest of this paper is organized into five parts. In Section 2, we provide a brief background of hLDA. In Section 3, we present our hrLDA model and the ontology generation method. In Section 4, we demonstrate empirical results regarding topic hierarchies and generated terminological ontologies. Finally, in Section 5, we present some concluding remarks and discuss avenues for future work and improvements.

Ii Background

In this section, we introduce our main baseline model, hierarchical latent Dirichlet allocation (hLDA), and some of its extensions. We start from the components of hLDA - latent Dirichlet allocation (LDA) and the Chinese Restaurant Process (CRP)- and then explain why hLDA needs improvements in both building hierarchies and drawing topic paths.

LDA is a three-level Bayesian model in which each document is a composite of multiple topics, and every topic is a distribution over words. Due to the lack of determinative information, LDA is unable to distinguish different instances containing the same content words, (e.g. “I trimmed my polished nails” and “I have just hammered many rusty nails”). In addition, in LDA all words are probabilistically independent and equally important. This is problematic because different words and sentence elements should have different contributions to topic generation. For instance, articles contribute little compared to nouns, and sentence subjects normally contain the main topics of a document.

Introduced in hLDA, CRP partitions words into several topics by mimicking a process in which customers sit down in a Chinese restaurant with an infinite number of tables and an infinite number of seats per table. Customers enter one by one, with a new customer choosing to sit at an occupied table or a new table. The probability of a new customer sitting at the table with the largest number of customers is the highest. In reality, customers do not always join the largest table but prefer to dine with their acquaintances. The theory of distance-dependent CRP was formerly proposed by David Blei

[22]. We provide later in Section III-C an explicit formula for topic partition given that adjacent words and sentences tend to deal with the same topics.

hLDA combines LDA with CRP by setting one topic path with fixed depth for each document. The hierarchical relationships among nodes in the same path depend on an dimensional Dirichlet distribution that actually arranges the probabilities of topics being on different topic levels. Despite the fact that the single path was changed to multiple paths in some extensions of hLDA - the nested Chinese restaurant franchise processes [23] and the nested hierarchical Dirichlet Processes [24], - this topic path drawing strategy puts words from different domains into one topic when input data are mixed with topics from multiple domains. This means that if a corpus contains documents in four different domains, hLDA is likely to include words from the four domains in every topic (see Figure 9). In light of the various inadequacies discussed above, we propose a relation-based model, hrLDA. hrLDA incorporates semantic topic modeling with relation extraction to integrate syntax and has the capacity to provide comprehensive hierarchies even in corpora containing mixed topics.

Iii Hierarchical Relation-based Latent Dirichlet Allocation

The main problem we address in this section is generating terminological ontologies in an unsupervised fashion. The fundamental concept of hrLDA is as follows. When people construct a document, they start with selecting several topics. Then, they choose some noun phrases as subjects for each topic. Next, for each subject they come up with relation triplets to describe this subject or its relationships with other subjects. Finally, they connect the subject phrases and relation triplets to sentences via reasonable grammar. The main topic is normally described with the most important relation triplets. Sentences in one paragraph, especially adjacent sentences, are likely to express the same topic.

We begin by describing the process of reconstructing LDA. Subsequently, we explain relation extraction from heterogeneous documents. Next, we propose an improved topic partition method over CRP. Finally, we demonstrate how to build topic hierarchies that bind with extracted relation triplets.

Iii-a Relation-based Latent Dirichlet Allocation

Documents are typically composed of chunks of texts, which may be referred to as sections in Word documents, paragraphs in PDF documents, slides in presentation documents, etc. Each chunk is composed of multiple sentences that are either atomic or complex in structure, which means a document is also a collection of atomic and/or complex sentences. An atomic sentence (see module in Figure 2) is a sentence that contains only one subject (), one object () and one verb () between the subject and the object. For every atomic sentence whose object is also a noun phrase, there are at least two relation triplets (e.g., “The tiger that gave the excellent speech is handsome” has relation triplets: (tiger, give, speech), (speech, be given by, tiger), and (tiger, be, handsome)). By contrast, a complex sentence can be subdivided into multiple atomic sentences. Given that the syntactic verb in a relation triplet is determined by the subject and the object, a document in a corpus can be ultimately reduced to subject phrases (we convert objects to subjects using passive voice) associated with relation triplets . Number is usually larger than the actual number of noun phrases in document . By replacing the unigrams in LDA with relation triplets, we retain definitive information and assign salient noun phrases high weights.

We define

as a Dirichlet distribution parameterized by hyperparameters

, as a multinomial distribution parameterized by hyperparameters , as a Dirichlet distribution parameterized by , and as a multinomial distribution parameterized by . We assume the corpus has topics. Assigning topics to the relation triplets of document follows a multinomial distribution with prior . Selecting the relation triplets for document given the topics follows a multinomial distribution with prior . We denote as the list of relation triplet lists extracted from all documents in the corpus, and as the list of topic assignments of . We denote the relation triplet counts of documents in the corpus by . The graphical representation of the relation-based latent Dirichlet allocation (rLDA) model is illustrated in Figure 2.

Fig. 2: Plate notation of rLDA

The plate notation can be decomposed into two types of Dirichlet-multinomial conjugated structures: document-topic distribution and topic-relation distribution

. Hence, the joint distribution of

and can be represented as


where is the number of unique relations in all documents, is the number of occurrences of the relation triplet generated by topic in all documents, and is the number of relation triplets generated by topic in document .

is a conjugate prior for

and thus the posterior distribution is a new Dirichlet distribution parameterized by . The same rule applies to .

Iii-B Relation Triplet Extraction

Extracting relation triplets is the essential step of hrLDA, and it is also the key process for converting a hierarchical topic tree to an ontology structure. The idea is to find all syntactically related noun phrases and their connections using a language parser such as the Stanford NLP parser [25] and Ollie [26]. Generally, there are two types of relation triplets:

  • Subject-predicate-object-based relations,
    e.g., New York is the largest city in the United States (New York, be the largest city in, the United States);

  • Noun-based/hidden relations,
    e.g., Queen Elizabeth (Elizabeth, be, queen).

A special type of relation triplets can be extracted from presentation documents such as those written in PowerPoint using document structures. Normally lines in a slide are not complete sentences, which means language parsing does not work. However, indentations and bullet types usually express inclusion relationships between adjacent lines. Starting with the first line in an itemized section, our algorithm scans the content in a slide line by line, and creates relations based on the current item and the item that is one level higher.

Iii-C Acquaintance Chinese Restaurant Process

As mentioned in Section 2, CRP always assigns the highest probability to the largest table, which assumes customers are more likely to sit at the table that has the largest number of customers. This ignores the social reality that a person is more willing to choose the table where his/her closest friend is sitting even though the table also seats unknown people who are actually friends of friends. Similarly with human-written documents, adjacent sentences usually describe the same topics. We consider a restaurant table as a topic, and a person sitting at any of the tables as a noun phrase. In order to penalize the largest topic and assign high probabilities to adjacent noun phrases being in the same topics, we introduce an improved partition method, Acquaintance Chinese Restaurant Process (ACRP).

The ultimate purposes of ACRP are to estimate , the number of topics for rLDA, and to set the initial topic distribution states for rLDA. Suppose a document is read from top to bottom and left to right. As each noun phrase belongs to one sentence and one text chunk (e.g., section, paragraph and slide), the locations of all noun phrases in a document can be mapped to a two-dimensional space where sentence location is the x axis and text chunk location is the y axis (the first noun phrase of a document holds value (0, 0)). More specifically, every noun phrase has four attributes: content, location, one-to-many relation triplets, and document ID. Noun phrases in the same text chunk are more likely to be “acquaintances;” they are even closer to each other if they are in the same sentence. In contrast to CRP, ACRP assigns probabilities based on closeness, which is specified in the following procedure.

  1. Let

    be the integer-valued random variable corresponding to the index of a topic assigned to the

    phrase. Draw a probability from Equations 2 to 5 below for the noun phrase , joining each of the existing topics and the new topic given the topic assignments of previous noun phrases, . If a noun phrase joins any of the existing k topics, we denote the corresponding topic index by .

    • The probability of choosing the topic:

    • The probability of selecting any of the topics:

      • if the content of is synonymous with or an acronym of a previously analyzed noun phrase in the topic,

      • else if the document ID of is different from all document IDs belonging to the topic,

      • otherwise,


        where refers to the current number of noun phrases in the topic,

        represents the vector of

        chunk location differences of the noun phrase and all members in the topic, stands for the vector of sentence location differences, and is a penalty factor.

    Normalize the () probabilities to guarantee they are each in the range of [0, 1] and their sum is equal to 1.

  2. Based on the probabilities 2 to 5, we sample a topic index from for every noun phrase, and we count the number of unique topics in the end. We shuffle the order of documents and iterate ACRP until is unchanged.

Iii-D Nested Acquaintance Chinese Restaurant Process

The procedure for extending ACRP to hierarchies is essential to why hrLDA outperforms hLDA. Instead of a predefined tree depth , the tree depth for hrLDA is optional and data-driven. More importantly, clustering decisions are made given a global distribution of all current non-partitioned phrases (leaves) in our algorithm. This means there can be multiple paths traversed down a topic tree for each document. With reference to the topic tree, every node has a noun phrase as its label and represents a topic that may have multiple sub-topics. The root node is visited by all phrases. In practice, we do not link any phrases to the root node, as it contains the entire vocabulary. An inner node of a topic tree contains a selected topic label. A leaf node contains an unprocessed noun phrase. We define a hashmap with a document ID as the key and the current leaf nodes of the document as the value. We denote the current tree level by . We next outline the overall algorithm.

  1. We start with the root node () and apply rLDA to all the documents in a corpus.

    1. Collect the current leaf nodes of every document. initially contains all noun phrases in the corpus. Assign a cluster partition to the leaf nodes in each document based on ACRP and sample the cluster partition until the number of topics of all noun phrases in is stable or the iteration reaches the predefined number of iteration times (whichever occurs first).

    2. Mark the number of topics (child nodes) of parent node at level as . Build a - dimensional topic proportion vector based on .

    3. For every noun phrase in document , form the topic assignments based on .

    4. Generate relation triplets from given and the associated topic vector .

    5. Eliminate partitioned leaf nodes from . Update the current level by .

  2. If phrases in are not yet completely partitioned to the next level and is less than , continue the following steps. For each leaf node, we set the top phrase (i.e., the phrase having the highest probability) as the topic label of this leaf node and the leaf node becomes an inner node. We next update and repeat procedures .

To summarize this process more succinctly: we build the topic hierarchies with rLDA in a divisive way (see Figure 3). We start with the collection of extracted noun phrases and split them using rLDA and ACRP. Then, we apply the procedure recursively until each noun phrase is selected as a topic label. After every rLDA assignment, each inner node only contains the topic label (top phrase), and the rest of the phrases are divided into nodes at the next level using ACRP and rLDA. Hence, we build a topic tree with each node as a topic label (noun phrase), and each topic is composed of its topic labels and the topic labels of the topic’s descendants. In the end, we finalize our terminological ontology by linking the extracted relation triplets with the topic labels as subjects.

Fig. 3: Graphical representation of hrLDA

We use collapsed Gibbs sampling [27] for inference from posterior distribution based on Equation 1. Assume the noun phrase in parent node comes from document . We denote unassigned noun phrases from document in parent node by , and unique noun phrases in parent node by . We simplify the probability of assigning the noun phrase in parent node to topic among topics as


where refers to all topic assignments other than , is multinational document-topic distribution for unassigned noun phrases , is the multinational topic-relation distribution for topic , is the number of occurrences of noun phrase in topic except the noun phrase in , stands for the number of times that topic occurs in excluding the noun phrase in . The time complexity of hrLDA is , where is the number of topics at level . The space complexity is .

In order to build a hierarchical topic tree of a specific domain, we must generate a subset of the relation triplets using external constraints or semantic seeds via a pruning process [28]. As mentioned above, in a relation triplet, each relation connects one subject and one object. By assembling all subject and object pairs, we can build an undirected graph with the objects and the subjects constituting the nodes of the graph [29]. Given one or multiple semantic seeds as input, we first collect a set of nodes that are connected to the seed(s), and then take the relations from the set of nodes as input to retrieve associated subject and object pairs. This process constitutes one recursive step. The subject and object pairs become the input of the subsequent recursive step.

Iv Empirical Results

Iv-a Implementation

We utilized the Apache poi library to parse texts from pdfs, word documents and presentation files; the MALLET toolbox [30] for the implementations of LDA, optimized_LDA [31] and hLDA; the Apache Jena library to add relations, properties and members to hierarchical topic trees; and Stanford Protege111http://protege.stanford.edu/ for illustrating extracted ontologies. We make our code and data available 222https://github.com/XiaofengZhu/hrLDA. We used the same empirical hyper-parameter setting (i.e., , , and ) across all our experiments. We then demonstrate the evaluation results from two aspects: topic hierarchy and ontology rule.

Iv-B Hierarchy Evaluation

In this section, we present the evaluation results of hrLDA tested against optimized_LDA, hLDA, and phrase_hLDA (i.e., hLDA based on noun phrases) as well as ontology examples that hrLDA extracted from real-world text data. The entire corpus we generated contains 349,362 tokens (after removing stop words and cleaning) and is built from articles on . It includes 84 presentation files, articles from 1,782 Wikipedia pages and 3,000 research papers that were published in IEEE manufacturing conference proceedings within the last decade. In order to see the performance in data sets of different scales, we also used a smaller corpus Wiki that holds the articles collected from the Wikipedia pages only.

We extract a single level topic tree using each of the four models; hrLDA becomes rLDA, and phrase_hLDA becomes phrase-based LDA. We have tested the average perplexity and running time performance of ten independent runs on each of the four models [32, 33]. Equation 7 defines the perplexity, which we employed as an empirical measure.


where is a vector containing the relation triplets in document , and is the topic assignment for .

The comparison results on our Wiki corpus are shown in Figure 4. hrLDA yields the lowest perplexity and reasonable running time. As the running time spent on parameter optimization is extremely long (the optimized_LDA requires 19.90 hours to complete one run), for efficiency, we adhere to the fixed parameter settings for hrLDA.

Fig. 4: Comparison results of hrLDA, phrase_hLDA, hLDA and optimized_LDA on perplexity and running time


Figures 5 to 7 illustrates the perplexity trends of the three hierarchical topic models (i.e., hrLDA, phrase_hLDA and hLDA) applied to both the Wiki corpus and the entire corpus with chip” given different level settings. From left to right, hrLDA retains the lowest perplexities compared with other models as the corpus size grows. Furthermore, from top to bottom, hrLDA remains stable as the topic level increases, whereas the perplexity of phrase_hLDA and especially the perplexity of hLDA become rapidly high. Figure 8

highlights the perplexity values of the three models with confidence intervals in the final state. As shown in the two types of experiments, hrLDA has the lowest average perplexities and smallest confidence intervals, followed by phrase_hLDA, and then hLDA.

(a) The Wiki corpus
(b) The entire corpus
Fig. 5: Perplexity trends within 2000 iterations with level = 2
(a) The Wiki corpus
(b) The entire corpus
Fig. 6: Perplexity trends within 2000 iterations with level = 6
(a) The Wiki corpus
(b) The entire corpus
Fig. 7: Perplexity trends within 2000 iterations with level = 10
(a) The Wiki corpus
(b) The entire corpus
Fig. 8: Average perplexities with confidence intervals of the three models in the final 2000th iteration with level = 10

Our interpretation is that hLDA and phrase_hLDA tend to assign terms to the largest topic and thus do not guarantee that each topic path contains terms with similar meaning.


Figure 9 shows exhaustive hierarchical topic trees extracted from a small text sample with topics from four domains: , , , and . hLDA tends to mix words from different domains into one topic. For instance, words on the first level of the topic tree come from all four domains. This is because the topic path drawing method in existing hLDA-based models takes words in the most important topic of every document and labels them as the main topic of the corpus. In contrast, hrLDA is able to create four big branches for the four domains from the root. Hence, it generates clean topic hierarchies from the corpus.

(a) A toy corpus in domains: semiconductor, integrated circuit, Berlin, and London
(b) The topic tree obtained from hLDA; each node contains the top five words ordered by their probabilities of being in the corresponding topics
(c) The topic tree (left panel class hierarchy) with relations (right panel class annotations) obtained from hrLDA
Fig. 9: Performance of hLDA and hrLDA on a toy corpus of diversified topics

Iv-C Gold Standard-based Ontology Evaluation

The visualization of one concrete ontology on the domain is presented in Figure 10. For instance, Topic packaging contains topic integrated circuit packaging, and topic label jedec is associated with relation triplet (jedec, be short for, joint electron device engineering council).

Fig. 10: A 10-level semiconductor ontology that contains 2063 topics and 6084 relation triplets

We use KB-LDA, phrase_hLDA, and LDA+GSHL as our baseline methods, and compare ontologies extracted from hrLDA, KB-LDA, phrase_hLDA, and LDA+GSHL with DBpedia ontologies. We use precision, recall and F-measure for this ontology evaluation. A true positive case is an ontology rule that can be found in an extracted ontology and the associated ontology of DBpedia. A false positive case is an incorrectly identified ontology rule. A false negative case is a missed ontology rule. Table LABEL:table:ar shows the evaluation results of ontologies extracted from Wikipedia articles pertaining to European Capital Cities (Corpus E), Office Buildings in Chicago (Corpus O) and Birds of the United States (Corpus B) using hrLDA, KB-LDA, phrase_hLDA (tree depth = 3), and LDA+GSHL in contrast to these gold ontologies belonging to DBpedia. The three corpora used in this evaluation were collected from Wikipedia abstracts, the same text source of DBpedia. The seeds of hrLDA and the root concepts of LDA+GSHL are capital, building, and bird. For both KB-LDA and phrase_hLDA we kept the top five tokens in each topic as each node of their topic trees is a distribution/list of phrases. hrLDA achieves the highest precision and F-measure scores in the three experiments compared to the other models. KB-LDA performs better than phrase_hLDA and LDA+GSHL, and phrase_hLDA performs similarly to LDA+GSHL. In general, hrLDA works well especially when the pre-knowledge already exists inside the corpora. Consider the following two statements taken from the corpus on Birds of the United States as an example. In order to use two short documents “The Acadian flycatcher is a small insect-eating bird.” and “The Pacific loon is a medium-sized member of the loon.” to infer that the Acadian flycatcher and the Pacific loon are both related to topic bird, the pre-knowledge that “the loon is a species of bird” is required for hrLDA. This example explains why the accuracy of extracting ontologies from this kind of corpus is low.

Domain Corpus E Corpus O Corpus B
hrLDA 96.0 92.4 84.0
KB-LDA 90.7 89.9 79.4
phrase_hLDA 27.6 27.4 24.5
Precision LDA+GSHL 52.4 19.8 28.6
hrLDA 86.9 74.7 81.9
KB-LDA 83.8 75.4 63.3
phrase_hLDA 50.6 57.5 36.5
Recall LDA+GSHL 20.0 73.1 11.8
hrLDA 91.2 82.6 82.9
KB-LDA 87.1 82.0 70.4
phrase_hLDA 35.7 26.8 29.3
F-measure LDA+GSHL 29.0 31.2 16.7
TABLE I: Precision, recall and F-measure (%)

V Concluding Remarks

In this paper, we have proposed a completely unsupervised model, hrLDA, for terminological ontology learning. hrLDA is a domain-independent and self-learning model, which means it is very promising for learning ontologies in new domains and thus can save significant time and effort in ontology acquisition.

We have compared hrLDA with popular topic models to interpret how our algorithm learns meaningful hierarchies. By taking syntax and document structures into consideration, hrLDA is able to extract more descriptive topics. In addition, hrLDA eliminates the restrictions on the fixed topic tree depth and the limited number of topic paths. Furthermore, ACRP allows hrLDA to create more reasonable topics and to converge faster in Gibbs sampling.

We have also compared hrLDA to several unsupervised ontology learning models and shown that hrLDA can learn applicable terminological ontologies from real world data. Although hrLDA cannot be applied directly in formal reasoning, it is efficient for building knowledge bases for information retrieval and simple question answering. Also, hrLDA is sensitive to the quality of extracted relation triplets. In order to give optimal answers, hrLDA should be embedded in more complex probabilistic modules to identify true facts from extracted ontology rules. Finally, one issue we have not addressed in our current study is capturing pre-knowledge. Although a direct solution would be adding the missing information to the data set, a more advanced approach would be to train topic embeddings to extract hidden semantics.


This work was supported in part by Intel Corporation, Semiconductor Research Corporation (SRC). We are obliged to Professor Goce Trajcevski from Northwestern University for his insightful suggestions and discussions. This work was partly conducted using the Protege resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.


  • [1] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
  • [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “Dbpedia: A nucleus for a web of open data,” in Proceedings of the 6th International Semantic Web Conference.   Springer, 2007, pp. 722–735.
  • [3] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic knowledge,” in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 697–706.
  • [4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008, pp. 1247–1250.
  • [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell, “Toward an architecture for never-ending language learning,” in AAAI, 2010.
  • [6] F. Niu, C. Zhang, C. Ré, and J. W. Shavlik, “Deepdive: Web-scale knowledge-base construction using statistical learning and inference,” VLDS, vol. 12, pp. 25–28, 2012.
  • [7] S. Mukherjee, J. Ajmera, and S. Joshi, “Domain cartridge: Unsupervised framework for shallow domain ontology construction from corpus,” in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management.   ACM, 2014, pp. 929–938.
  • [8] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-scale approach to probabilistic knowledge fusion,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2014, pp. 601–610.
  • [9] Z. Wei, J. Zhao, K. Liu, Z. Qi, Z. Sun, and G. Tian, “Large-scale knowledge base completion: Inferring via grounding network sampling over selected instances,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’15.   ACM, 2015, pp. 1331–1340.
  • [10] S. P. Chatzis, “Inducing space dirichlet process mixture large-margin entity relationshipinference in knowledge bases,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.   ACM, 2015, pp. 1311–1320.
  • [11] D. Q. Nguyen, K. Sirts, L. Qu, and M. Johnson, “Neighborhood mixture model for knowledge base completion,” arXiv preprint arXiv:1606.06461, 2016.
  • [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”

    JOURNAL of Machine Learning Research

    , vol. 3, pp. 993–1022, 2003.
  • [13] I. Ocampo-Guzman, I. Lopez-Arevalo, and V. Sosa-Sosa, “Data-driven approach for ontology learning,” in Electrical Engineering, Computing Science and Automatic Control, CCE, 2009 6th International Conference on.   IEEE, 2009, pp. 1–6.
  • [14] W. Wei, P. Barnaghi, and A. Bargiela, “Probabilistic topic models for learning terminological ontologies,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 7, pp. 1028–1040, 2010.
  • [15] Y. Jing, W. Junli, and Z. Xiaodong, “An ontology term extracting method based on latent dirichlet allocation,” in Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on.   IEEE, 2012, pp. 366–369.
  • [16] A. Slutsky, X. Hu, and Y. An, “Tree labeled lda: A hierarchical model for web summaries,” in Big Data, 2013 IEEE International Conference on.   IEEE, 2013, pp. 134–140.
  • [17] F. Colace, M. De Santo, L. Greco, F. Amato, V. Moscato, and A. Picariello, “Terminological ontology learning and population using latent dirichlet allocation,” Journal of Visual Languages & Computing, vol. 25, no. 6, pp. 818–826, 2014.
  • [18] D. Movshovitz-Attias and W. W. Cohen, “Kb-lda: Jointly learning a knowledge base of hierarchy, relations, and facts,” in

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

    , 2015, pp. 1449–1459.
  • [19] Z. Hu, G. Luo, M. Sachan, E. Xing, and Z. Nie, “Grounding topic models with knowledge bases,” in

    Proceedings of the 24th International Joint Conference on Artificial Intelligence

    , 2016.
  • [20] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum, “Hierarchical topic models and the nested chinese restaurant process,” in Advances in Neural Information Processing Systems 16.   MIT Press, 2004, p. 17.
  • [21] D. M. Blei, T. L. Griffiths, and M. I. Jordan, “The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies,” Journal of the ACM (JACM), vol. 57, no. 2, p. 7, 2010.
  • [22] D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes,” The Journal of Machine Learning Research, vol. 12, pp. 2461–2488, 2011.
  • [23] A. Ahmed, L. Hong, and A. J. Smola, “Nested chinese restaurant franchise process: Applications to user tracking and document modeling,” in ICML (3), 2013, pp. 1426–1434.
  • [24] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchical dirichlet processes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 256–270, 2015.
  • [25] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard., and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.
  • [26] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni, “Open language learning for information extraction,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 524–534.
  • [27] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101, no. suppl 1, pp. 5228–5235, 2004.
  • [28]

    M. Thelen and E. Riloff, “A bootstrapping method for learning semantic lexicons using extraction pattern contexts,” in

    Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10.   Association for Computational Linguistics, 2002, pp. 214–221.
  • [29] S. Krause, H. Li, H. Uszkoreit, and F. Xu, “Large-scale learning of relation-extraction rules with distant supervision from the web,” The Semantic Web–ISWC 2012, pp. 263–278, 2012.
  • [30] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002, http://mallet.cs.umass.edu.
  • [31] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smoothing and inference for topic models,” Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34, 2009.
  • [32] A. Gangopadhyay, M. Molek, Y. Yesha, M. Brady, and Y. Yesha, “A methodology for ontology evaluation using topic models,” in Intelligent Networking and Collaborative Systems (INCoS), 2012 4th International Conference on.   IEEE, 2012, pp. 390–395.
  • [33] D. Downey, C. S. Bhagavatula, and A. Yates, “Using natural language to integrate, evaluate, and optimize extracted knowledge bases,” in Proceedings of the 2013 workshop on Automated knowledge base construction.   ACM, 2013, pp. 61–66.