Document Structure Measure for Hypernym discovery

11/30/2018 ∙ by Aswin Kannan, et al. ∙ 0

Hypernym discovery is the problem of finding terms that have is-a relationship with a given term. We introduce a new context type, and a relatedness measure to differentiate hypernyms from other types of semantic relationships. Our Document Structure measure is based on hierarchical position of terms in a document, and their presence or otherwise in definition text. This measure quantifies the document structure using multiple attributes, and classes of weighted distance functions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Annotating data and deciphering relationships is a key task in the field of text analytics. Hypernym discovery falls into one major embodiment of such relationship classes. More specifically, this deduces “is-a” relationship between any two terms in a document corpus, thereby falling into the scope of several other applications in Natural Language Processing including Entailment and Co-reference Resolution.

A number of supervised methods have been proposed for this problem [5, 3] and their efficient performance has been clearly observed. However, one of the major drawbacks is the requirement of training data in specific form and a larger volume as with all their learning subsidiaries. Say, in case of the music domain [2], document corpuses pertaining to training data change frequently, thereby rendering any supervision impractical. Similar examples can be found in regulatory compliance, where guidelines and circulars get frequently updated by bureaucrats.

Specific to these settings, some unsupervised methods [7] have seen to demonstrate promising performance in select domains. One recent work [1] in the context of land surveying addressed these issues in a purely rule based framework. This rests on using word similarities and co-occurrence frequency in deciphering the hierarchical relationship between any two terms. [7] surveyed a number of relatedness measures on multiple datasets and proposed ways to choose the measures based on the dataset. More specifically, different measures such as similarity, informativeness, inclusion, and reverse inclusion were proposed and a combination of multiple such measures demonstrated better performance. [9]

also proposed similar distributional inclusion vectors (combining multiple measures) and observed similar results.

Our framework follows the lines of [7], but however focuses majorly on modeling document structure in the form of mathematical measures. In recent work[13]

, there has been an exploration of using hierarchical Structures for hypernym discovery, but they assume presence of Wikipedia entries for many of the candidate terms. Our document structure measures on the other hand are very general in that they encompass (but not restricted to) section titles, bulleted lists, and highlighted text. We also extend this concept of document structure to classify personal data (binary). Some examples correspond to metadata headers in JSON files.

Our contributions are three fold: firstly, we mathematically formulate indicator based context vectors to quantify our structural measures. Secondly, based on the context vectors, we develop mathematical functions to relate word-pairs and classify personal data. Finally, it is not straightforward to manually decipher and annotate documents with respect to their document structure. We render this automation feasible by deploying System-T based rules and methods. Some primary works in the line of these rule based information extraction schemes are presented in [10, 11]. Recent work [12] has specifically dealt with extracting titles, section and subsection headings, paragraphs, and sentences from large documents. Additionally [8] extracts structure and concepts from html documents in compliance specific applications. We deploy bases from [8, 12] to automate our process of document structure discovery and annotations.

The rest of the paper is organized into four sections. Section 2 expands on the basics of hypernym discovery and inclusion measures. Section 3 discusses in detail our document structure measures. In section 4

, we test the performance of our algorithms on the older Wikipedia data sets and compare them with results from literature. In process, we also introduce newer data sets (ENRON Email corpus) and observe the performance of our algorithms. Initial numerical results are promising and seem to pave way for further analysis of such context vectors based on deep learning and machine learning based approaches. We conclude in section 


2 Background

For the purpose of completeness, we restate the measures relevant to our work from past literature [7]. To quantify the closeness of any two words, x and y, we restate the definition of inclusion measure as follows.


The above is extremely general in that the square root can be replaced by suitable convex functions. These can be nonlinear or even nonsmooth. The significance of the function is that this mimics general distance functions like the euclidean norm. CDE refers to the Clarke DE measure which is also restated as follows.

where and

respectively denote the words in the document and C(.) refers to the context in which these words occur. Here, we note that CDE is usually similar to a probability distribution, with an upper limit on the value to be 1. In short, the numerator denotes the number of times words

and appear together in a context and the denominator merely denotes the total number of instances in which at-least one of and appear independently. The context is extremely crucial and completely helps in ascertaining whether the words and appear together because of hypernymy relationship (or because of coincidence).

Getting into finer details, quantifies the context vector of word (also termed as embeddings). This context can refer to the window of operation, features representing Parts Of Speech (POS), character embeddings, and so on. Say, if the window of operation, , then all words that appear within a distance of 4 to the left and right of the word . This can obviously refer to words of different dimensions and this can also be generalized to individual characters also depending upon the requirement. Even generalizations exist where this can be kept as a sliding window, the size of which can change from one word to another (depending completely on the domain and sub-domain). In cases where repetitions occur (in all possible contexts not necessarily restricted to the above), averaging amongst samples is encouraged. Alternatively, logarithmic probability functions can also be deployed.

From the above concepts of word embeddings, we derive quantifiers for document structure to predict hypernym relationships. While this is discussed in detail in the next section, we set a prelude in terms of the mathematics behind coming up with those ideas. For the present section for the purpose of notational clarity, we do not disturb the specification of the vector . We instead add a new vector to specify additional dimensions.

To begin with, let us consider the instance of section titles. Motivated by the fact that hypernyms can be found in the section titles with a higher probability than hyponyms, we specify functionals that capture such cases. As a simple example, the word “Country” or “Geographical Location” is a good qualifier for a header, whereas South-East Asia and “Northern America” are less probable to be headers. On a side node, if South-East Asia forms the part of a section title, it is more likely for the document title to contain the word “Country”. This can be expressed in a mathematical sense, where a word appearing in a section title proportionately increases the probability of it being a hypernym. The inverse relationship is defined by an appropriate fraction. Let denote the context specific to section titles for a word . Then, the distribution function can be expressed as follows.


Description of text contain significant number of hypernyms. Say, the word color can be describe blue or green and used repetitively in text to establish certain relationships or define key points in a discussion. Say, in historical text, political party names can be used quite often to describe a scenario. These obviously qualify as hypernyms with the composite elements like leaders, affiliates, and most importantly the sub-classes in multiple nations. Say, the democratic party may be used to describe some text and can refer to different names in different countries. However, we note one caveat and observe that even hyponyms can be found in description text. Let denote the context specific to definition text. To clearly state the higher probability of the former, we come up with the following logarithmic function.


Note that all the weights , and are nonnegative. At this point, we clearly note that the expressions used above are very rudimentary and merely help in clear understanding of how hypernymy measures should be constructed based on document structure. For a deeper dive and strong theoretical and numerical conceptualization, we move to the forthcoming section.

3 Deploying Structure

Prior to defining our mathematical model, it is important to observe that some words can be hypernyms in general, without being specifically associated to a word. These words form the top portions of hierarchies. Say, for instance “personal information” can point to names, addresses, biographical details etc. and covers a very broad range of labels. While it is true that these cannot be called hypernyms without the presence of the required subsets, we apportion a value of probability independently to such generalized terms. Next, we define vectors, , , and to define the contexts in which a word can be a relational hypernym, relational hyponym, and general hypernym respectively. The contexts directly correspond to document structure and we additionally note that and have the same contexts, but in different (opposite) senses.

3.1 Relational Context Vectors:

We start with an example of bulleted text. As we have context windows, here we split the entire document into multiple paragraphs ensuring that each paragraph at the most contains only one set of bulleted text. It is obvious to note that bulleted text are more probable to be hyponyms (Lists / Enumerations also included). Consider the instance of text below.

“X contains the following:

  • X1

  • X2”

Here, X1 and X2 are hyponyms of X. Given two words and , their probability of hypernym-hyponym relationship specific to a context (in this case bulleted list) can be stated as follows:

In the above expression refers to an entity mention, which in this case is the presence of the word in paragraph that contains a bulleted list. In case a paragraph does not contain a bulleted list, the corresponding entries are 0 for both the numerator and denominator sub-portions. In some cases a bulleted list may contain more than one occurrence of a hypernym / word. We merely consider the above expression to have an indicator function and do not pursue on the track of multiple occurrences. The vectors and are specified in an opposite sense. If a word occurs at the text preceeding the bullets, they are directed towards the indicator function in and if they occur within the sub-bullets, those are attributed towards the indicator function in . Usually in such cases if , then . However, it can also be true that some words can be present in both the preceeding text or sub-bullets, leading to both and taking the value of 1.

Generalizing the above to all possible contexts, we have the following.

where refers to the weight assigned to each context (pre-set by the user depending on the application) and denotes an importance function corresponding to the occurrence of the entities both in the presence of the context and overall (presence and absence included). More specifically, refers to the number of times the word has occurred in text preceeding the bullets in the document and denotes the overall number of times the word has appeared in the document. Besides the bulleted list, the other contexts that we consider in this work are the following:

  • Hyperlinks / URL content is more probable to contain hypernyms in the first portion and hyponyms in the second portion.

    Eg:…./symptoms/headache. Note that symptoms goes into the bin and headache into .

  • Footnotes are more probable to contain hyponyms.

    Eg: Let us consider . Here, “Word” is the Hypernym and the footnote corresponding to “” can contain hyponyms.

  • Section Headers / Paragraph headers / Subsections follow hierarchical order. Say, when a word x occurs in the section title and word y occurs in the paragraph, x is more probable to be a hypernym of y.

  • Words within brackets are more probable to be hyponyms.

    Eg: Eastern Geographical Location (Say, Japan, Singapore, and Thailand).

  • Subscripts and Superscripts are more probable to be hyponyms.

    Eg: , , and Class. Here, class is the hypernym.

  • Words succeeding Indents are very probable to be hypernyms (very similar to section titles). First few words after an indent denote an opening and can contain a hypernym.

    Eg:    A few priorities are required. Say, evening exercises, Yoga, and jogging help maintaining fitness.

  • Words defining Under-braces and Over-braces are usually Hypernyms (Say in mathematical descriptors)

    Eg: The following are quite helpful in figuring out the essence of this article.

3.2 General Context Vectors

The expression of the context vector differs slightly in this generalized case. While we look at pairs, these are only specific to occurrences of the hyponym within a window of . In this case, the vector covers and accounts for both the hypernym and hyponym. Say, when a word’s context indicates a hypernym, the corresponding value of is set to 1. In cases of hyponyms, the value is set to -1 and 0 otherwise. The probability measure takes the following form:

The expression including the weights defining the complete function follows the same pursuit as earlier. In this work we consider the following contexts for general vectors.

  • Captions of figures and tables are more probable to contain hypernyms.

  • Text with hyphens, semicolon, commas, quotations are more probable to contain hypernyms.

  • Word preceeding a question mark if a noun is more probable to be a hyponym.

  • Words within Markings / Watermark / Highlighting are more probable to be hyponyms / confidential information.

  • Single worded cells in excel like data are more probable to contain hypernyms.

  • When looking at shapes, the first few boxes would correspond to hypernyms. As an example, this is quite common in MS-Visio.

  • Upper cased words are more probable to be hypernyms.

  • Color / Bold / Italics / Underline are very similar to highlighting and are usually hypernyms.

  • Words after symbols “” are usually hyponyms.

  • Words corresponding to Info-Boxes / Remarks are usually hypernyms (examples from Wikipedia).

  • Higher Font-Sized text are usually hypernyms.

Some additional contexts (with considerably lesser weights) in a minor sense are also considered and described as follows:

  • More Number of words in a cell in an excel file indicates higher probability of such words to be hyponyms.

  • Words in Introduction / Conclusion are likely to be hypernyms.

  • References are more probable to contain hyponyms.

  • Appendix based text contain more hypernyms.

  • Double Spaced text contain more hypernyms generally (eg. double spacing for quotations ? reported conversations).

  • Keywords / Abstract / General terms generally constitute hypernyms (Very common in journal articles).

3.3 Personal Data Extraction

While the scope of this paper is restricted to hypernymy, this research has a great potential value in general relationship discovery, say as examples personal data tagging and meronyms. Several data protection regulations demand extracting personal data entities from large document corpuses. Metadata in the notion of unstructured text is very helpful in finding out whether some portions contain sensitive data (say biometrics, genetic information, and credit card numbers). In this regard, for the purpose of completeness, we analyze document structure properties and state the following contexts to define measures in a very similar flavor as earlier.

  • Special characters such as “****, xxx” are usually associated with sensitive information.

  • Attachments with special names or numbers can include sensitive data.

  • As opposed to plain text, the probability of finding sensitive data in tables in much higher.

  • Section titles are very helpful in finding information about the context and in turn the possibility of personal data.

  • Footers of emails can contain personal data.

  • Responses to questionnaires and texts within blanks can point to sensitive data.

  • Boxed or colored (Say red) text can contain sensitive data.

  • Indented text and larger sized textual portions can contain sensitive data (say codes / PNR).

The corresponding measures can be constructed as earlier in the case of hypernymy.

4 Implementation

The intent of our numerical results is three fold. First, we note that the base code from [7] predicts hypernymy relationship between any two terms, whereas our work predicts hypernyms for any specific input term . Our first numerical contribution is from the standpoint of SemEval, where the task is to predict such hypernyms (more details below). Note that we included bigrams as opposed to unigrams and used Spacy to generate POS tags as required by the original implementation. We further mention that these are very general structure theoretic measures that helps in hypernymy detection. Second, we try to show the enhancement in hypernymy detection by using our proposed measures. Second, we introduce a new data set (ENRON) as a path for future study and run a portion of our exercises on the same. All the computation was done on a Linux Cluster with 8 cores and 4 GB RAM. We explicitly state that our implementation is built on top of publicly available code from  [7]. As mentioned by the authors of the corresponding base code, no one measure performs best on all datasets. Our document structure measures are built as wrappers (in Matlab and Python) around their base.

4.1 SemEval

In the SemEval tasks, since we are provided with vocabulary words file, which contains the exhaustive list of all possible hypernyms, we used them for prediction. We filtered the vocabulary words file to remove few corrupted words and very small words. We restricted our vocabulary to words provided to us but we did not use a minimum frequency filter as done by the original authors. In the original implementation of [7] two words are compared at time. We vectorized this scoring task by comparing each word at a time with all the possible hypernyms. Also, we used the different data structures for more efficiency and reduce latency. Once we got the scores for a given word, we ranked the words and produced top 100 words as hypernyms.

Data Pre-processing

We worked on the following two sub-tasks in the SemEval 2018 Hypernym Discovery Task 9:

  • Music

  • Medical

The corpus data provided by SemEval for each sub-task is unstructured data. It is in the form of untagged sentences with one sentence per line and provided to us in the form of text file. We used Spacy[citation needed] to parse each sentence of corpus and tagged each word with it’s lemma and POS tag. We then used the tagged corpus for training. We considered only the Nouns, Verbs and Adjectives by filtering out the remaining word types. We also restricted our corpus vocabulary to the words in the Vocabulary file provided to us.

Distributional Space

In [7], two parameters are described, Context Type and Feature Weighting. We experimented with different combinations of Context Type and Feature Weighting. We found the window-based context and PMLI feature weighting to work better on the Music and Medical datasets. Our evaluation on both the datasets are presented in Tables 1 and 2.

Measure MAP MRR P@5
win5 ClarkeDE 0.205 0.089 0.093
win5 invCL 0.197 0.088 0.092
win5d ClarkeDE 0.217 0.094 0.097
win5d invCL 0.211 0.093 0.096
Table 1: Music Dataset
Measure MAP MRR P@5
win5 ClarkeDE 0.204 0.091 0.091
win5 invCL 0.207 0.093 0.09
win5d ClarkeDE 0.22 0.097 0.099
win5d invCL 0.222 0.1 0.103
Table 2: Medical Dataset

4.2 Document Structure Measures:

We note that document structure measures can also be referred to as dependency based contexts and can be suitably deployed with dependency parse trees. We consider a set of four document structure features, namely section titles, footnotes, subscripts and superscripts, and captions for our study. The measures are tested on both datasets, namely Wackypedia (as the earlier subsection) and ENRON (emails). We test measures one at a time and all at once for Hypernyms and report the results for the two datasets as follows.

Wikipedia Corpus:

Here, we used an unsupervised scoring measure to determine whether a word is a hypernym of . We tried different measures mentioned in [7]. We observed that invCL and ClarkeDE are the best performers for Medical and Music Datasets respectively. Distributional inclusion hypothesis states that the prominent contexts of a hyponym() are expected to be included in its hypernym(). Using the best inclusion measures we found, given a word/hyponym, we score all vocabulary words for possible hypernyms and rank them. We take the top scorers and output them as hypernyms for a given word

. We are proposing a new measure for Hypernym Discovery based on the heuristic that expanding the context window with additional relevant words will improve the vector representation in a way to better distinguish hypernym relations from other relations. In the rest of the section, we describe different Document Structure based contexts, and provide the mathematical definition of the measure.

ENRON Emails:

We observe that hypernyms tend to more general, while hyponyms are more specific in defining a real word entity. We propose that the section and document headings tend to generalize the description of real world entities. PMLI as described in 3.2 gives the conditional probability of a term being a hypernym, given its co-occurence with another term. We observe that this probability increases when one of the terms is a generalized term that typically appears in document and section headings. Drawing from the field of information retrieval, we observe that text that describes a term, called definition text, is more likely to contain hypernym terms than any other text in the document. There is related work on identifying if a text is definition text. It might also be possible to assume that the first paragraph in a Wikipedia page, is more likely to describe a term, and hence can be considered a definition text. We plan to leverage work in the areas of text summarization to reduce the noise in our Document Structure measure.

5 Conclusion

We observe that the findings of [7] hold true on domain specific datasets that we experimented with. We found that there is not a single measure which impacts Hypernym Discovery. We have introduced a new relatedness measure, based on Document Structure to distinguish Hypernym relations from other kinds of semantic relations between two terms. Besides incremental performance, we see some good new predictions of hypernyms that were otherwise absent using standard contexts and measures in literature. Our analysis on relatively newer datasets like ENRON are in sync with real-world applications and helps with directions of future research.


  • [1] V. Baisa and V. Suchomel, Corpus based extraction of hypernyms in terminological thesaurus for land surveying domain, in Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, 2015, Tribun EU, pp. 69–74.
  • [2] L. Espinosa-Anke, S. Oramas, J. Camacho-Collados, and H. Saggion, Finding and expanding hypernymic relations in the music domain

    , in 19th International Conference of the Catalan Association for Artificial Intelligence (CCIA), Barcelona, Spain, 19/10/2016 2016.

  • [3] M. Kamel, C. Trojahn, A. Ghamnia, N. Aussenac-Gilles, and C. Fabre, Extracting hypernym relations from wikipedia disambiguation pages : comparing symbolic and machine learning approaches, in IWCS 2017 - 12th International Conference on Computational Semantics - Long papers, 2017.
  • [4] V. Shwartz, Y. Goldberg, and I. Dagan, Improving hypernymy detection with an integrated path-based and distributional method, CoRR, abs/1603.06076 (2016).
  • [5] R. Snow, D. Jurafsky, and A. Y. Ng, Learning syntactic patterns for automatic hypernym discovery, in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, eds., MIT Press, 2005, pp. 1297–1304.
  • [6] I. Yamada, K. Torisawa, J. Kazama, K. Kuroda, M. Murata, S. De Saeger, F. Bond, and A. Sumida, Hypernym discovery based on distributional similarity and hierarchical structures, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP ’09, Stroudsburg, PA, USA, 2009, Association for Computational Linguistics, pp. 929–937.
  • [7] Shwartz, Vered and Santus, Enrico and Schlechtweg, Dominik. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection, arXiv preprint (2016)
  • [8] Agarwal, Arvind and Ganesan, Balaji and Gupta, Ankush and Jain, Nitisha and Karanam, Hima P and Kumar, Arun and Madaan, Nishtha and Munigala, Vitobha and Tamilselvam, Srikanth G. Cognitive Compliance for Financial Regulations, IT Professional, 19-4, 28-35, IEEE. (2017)
  • [9] Chang, Haw-Shiuan and Wang, ZiYun and Vilnis, Luke and McCallum, Andrew. Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection. (2017)
  • [10] Chiticariu, Laura and Krishnamurthy, Rajasekar and Li, Yunyao and Raghavan, Sriram and Reiss, Frederick R and Vaithyanathan, Shivakumar. SystemT: an algebraic approach to declarative information extraction. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 128-137. Association for Computational Linguistics. (2010)
  • [11]

    Wang, Chenguang and Burdick, Doug and Chiticariu, Laura and Krishnamurthy, Rajasekar and Li, Yunyao and Zhu, Huaiyu. Towards re-defining relation understanding in financial domain. Proceedings of the 3rd International Workshop on Data Science for Macro–Modeling with Financial and Economic Datasets. 8, ACM. (2017)

  • [12] Madaan, Nishtha and Karanam, Hima and Gupta, Ankush and Jain, Nitisha and Kumar, Arun and Tamilselvam, Srikanth. Visual Exploration of Unstructured Regulatory Documents. Proceedings of the 22nd International Conference on Intelligent User Interfaces Companion. 129-132, ACM. (2017)
  • [13] Yamada, Ichiro and Torisawa, Kentaro and Kazama, Jun’ichi and Kuroda, Kow and Murata, Masaki and De Saeger, Stijn and Bond, Francis and Sumida, Asuka. Hypernym discovery based on distributional similarity and hierarchical structures. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pages 929-937, Association for Computational Linguistics. (2009)
  • [14] R. E. Bank and R. K. Smith, General sparse elimination requires no permanent integer storage, SIAM J. Sci. Stat. Comput., 8 (1987), pp. 574–584.
  • [15] S. C. Eisenstat, M. C. Gursky, M. Schultz, and A. Sherman, Algorithms and data structures for sparse symmetric gaussian elimination, SIAM J. Sci. Stat. Comput., 2 (1982), pp. 225–237.
  • [16] A. George and J. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, Englewood Cliffs, NJ, 1981.
  • [17] K. H. Law and S. J. Fenves, A node addition model for symbolic factorization, ACM TOMS, 12 (1986), pp. 37–50.
  • [18] J. W. H. Liu, A compact row storage scheme for cholesky factors using elimination trees, ACM TOMS, 12 (1986), pp. 127–148.
  • [19] , The role of elimination trees in sparse factorization, Tech. Report CS-87-12,Department of Computer Science, York University, Ontario, Canada, 1987.
  • [20] D. J. Rose, A graph theoretic study of the numeric solution of sparse positive definite systems, in Graph Theory and Computing, Academic Press, New York, 1972.
  • [21] D. J. Rose, R. E. Tarjan, and G. S. Lueker, Algorithmic aspects of vertex elimination on graphs, SIAM J. Comput., 5 (1976), pp. 226–283.
  • [22] D. J. Rose and G. F. Whitten, A recursive analysis of disection strategies, in Sparse Matrix Computations, Academic Press, New York, 1976.
  • [23] R. Schrieber, A new implementation of sparse gaussian elimination, ACM TOMS, 8 (1982), pp. 256–276.