AppTechMiner: Mining Applications and Techniques from Scientific Articles

09/10/2017 ∙ by Mayank Singh, et al. ∙ ERNET India 0

This paper presents AppTechMiner, a rule-based information extraction framework that automatically constructs a knowledge base of all application areas and problem solving techniques. Techniques include tools, methods, datasets or evaluation metrics. We also categorize individual research articles based on their application areas and the techniques proposed/improved in the article. Our system achieves high average precision ( 82 knowledge base creation. It also performs well in application and technique assignment to an individual article (average accuracy 66 further present two use cases presenting a trivial information retrieval system and an extensive temporal analysis of the usage of techniques and application areas. At present, we demonstrate the framework for the domain of computational linguistics but this can be easily generalized to any other field of research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

It is not uncommon for researchers to envisage an information extraction system for scientific articles that can answer queries like, (i)What are all the techniques and tools used in Machine Translation?, (ii)Which are the subareas of Computational Linguistics, where Malt Parser is frequently used? etc. However, the meta-information necessary for constructing such a system is rarely available. Each research domain consists of multiple application areas which are typically associated with various techniques used to solve problems in these areas. For instance, two commonly used techniques in Information Extraction

are “Conditional Random Fields” and “Hidden Markov Models”. Wikipedia lists 32 popular NLP tasks and sub-tasks

111 However, to our surprise, we do not find in this list many trending applications areas, for example, Dialog and Interactive systems, Social Media, Cognitive Modeling and Psycholinguistics, etc. In addition, new techniques are continuously being proposed/improved for an application area with time and changing needs. This temporal aspect raises diverse research questions - for example, how techniques for POS tagging varied over time, or, what are the most important areas of Computational Linguistics that have been addressed in the last five years? This also should be of huge interest for new researchers surveying for an application area.

Contributions: In this paper, we introduce AppTechMiner that automatically constructs a knowledge base of all application areas and problem solving techniques using a rule-based approach. Subsequently, the generated knowledge base can be employed in several information retrieval systems to answer aforementioned questions. We demonstrate the current framework construction for the domain of computational linguistics because of the availability of full-text research articles. However, the proposed construction mechanism can be easily generalized to any other field of research. Next, we define two common keywords used in the current paper: Area: Area represents an application area of a particular research domain. Common application areas (hereafter written in italics) in Computational Linguistics include Machine Translation, Dependency Parsing, POS Tagging, Information Extraction, etc.

Technique: A tecnhnique represents a tool or method used for a task. This may also include evaluation tool/method. Common examples (hereafter written within quotes) include “Bleu Score”, “Rouge Score”, “Charniak Parser”, “TnT Tagger”, etc. Note that technique of one paper can potentially be an area of another paper. For example, in “Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment” (Schoenemann, 2013), “Word Alignment” is an area but in “Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation” (He, 2007), “Word Alignment” is a technique for Machine Translation.

The entire framework is organized into four phases (Section 4):

  1. Creation of a ranked list of areas;

  2. Categorizing papers on the basis of areas;

  3. Creation of a ranked list of techniques;

  4. Categorizing papers on the basis of techniques.

Key results: We achieve high performance in each of the above phases (see Section 5). The precision of the first phase is 84% (for top 30 areas) and recall is 87%. For the second phase, the accuracy is 73.3%

. The third phase results in a precision and recall of

80% (for top 26 techniques) and 80.7% respectively. In the fourth phase, our system achieves an accuracy of 60%.

Use cases: In Section 6, we present two use-cases: (i) constructing an information retrieval system, and (ii) analysis of temporal characteristics of techniques associated with an area. We also investigate the temporal variation of the popular areas for specific conferences, namely, acl and coling.

2. Related Work

Extracting application area and techniques is primarily an information extraction task. Information extraction (IE) from scientific articles combines approaches from natural language processing and data mining and has generated substantial research interest in recent times. In particular, there has been burgeoning research interest in the domain of biomedical documents. Shah et al. 

(Shah et al., 2003) extracted keywords from full text of biomedical articles and claim that there exist a heterogeneity in the keywords from different sections. Muller et al. (Müller et al., 2008) have developed the Textpresso framework, that leverage ontologies for information retrieval and extraction. In a similar work, Fukuda et al. (Fukuda et al., 1998) proposed an IE system for protein name extraction. There has been significant work in information extraction in the area of protein structure analysis. Gaizauskas et al. (Gaizauskas et al., 2003) proposed PASTA, an IE system developed and evaluated for the protein structure domain. Friedman et al. (Friedman et al., 2001) have developed a similar system to extract structure information about cellular pathways using a knowledge model. Biological information extraction has seen extensive work covering diverse aspects with large number of survey papers. Cohen et al.’s (Cohen and Hersh, 2005) survey on biomedical text mining, Krallinger et al.’s (Krallinger et al., 2008) survey on information extraction and applications for biology and Wimalasuriya et al. (Wimalasuriya and Dou, 2010) on ontology based information extraction are examples of some of the popular surveys on IE for biomedical domain.

Information extraction in other domains has also received an equally strong attention from researchers. Hyponym relations have been extracted automatically in the celebrated work by Hearst et al. (Hearst, 1992). Caraballo et al. (Caraballo, 1999)

have extended previous work on automatically building semantic lexicons to automatic construction of a hierarchy of nouns and their hypernyms. Teufel 

(Teufel et al., 2000) proposed information management and information foraging for researchers and introduced a new document analysis technique called argumentative zoning which is useful for generating user-tailored and task-tailored summaries. Kim et al. (Kim et al., 2010) and Lopez et al. (Lopez and Romary, 2010) are two popular works in automatic keyphrase extraction from scientific articles. Quazvinian et al. (Qazvinian and Radev, 2008) have explored summarization of scientific papers using citation summary networks and citation summarization through keyphrase extraction (Qazvinian et al., 2010).

Jones (Jones, 2005) introduced an approach for entity extraction from labeled and unlabeled text. They proposed algorithms that alternately look at noun phrases and their local contexts to recognize members of a semantic class in context. A relatively recent work by Gupta et al. (Gupta and Manning, 2014) developed a pattern learning system with bootstrapped entity extraction. In Gupta et al. (Gupta and Manning, 2011), the authors investigated the dynamics of a research community by extracting key aspects from scientific papers and showed how extracting key information helps in analyzing the influence of one community on another. Jin at al. (Jin et al., 2013) proposed a supervised sequence labeling system that identifies scientific terms and their accompanying definition.

We believe that this is the first attempt to specifically mine application areas and techniques from research articles. Instead of complex statistical machine learning models, we employ rule-based approach, preferred in commercial world for information extraction tasks 

(Chiticariu et al., 2013). The proposed construction mechanism can be easily generalized to any other field of research.

3. Dataset

We use ACL Anthology Network (Radev et al., 2009) dataset which consists of 21,213 full text papers from the domain of computational linguistics and natural language processing. The dataset consists of papers between the years 1965 – 2013 from 342 ACL venues.

4. Methodology

In this section, we describe methods to construct knowledge base of areas and techniques. As we already pointed out in Section 1, the entire framework is organized in four phases: (1) creation of a ranked list of areas, (2) categorizing papers on the basis of areas, (3) creation of a ranked list of techniques, and (4) categorizing papers on the basis of techniques. Next, we briefly describe these four phases in further details.

4.1. Creation of a ranked list of areas

We employ paper title information to extract areas. We use hand-written rules to extract phrases which are likely to contain the area names. We observe that some functional keywords, such as, “for”, “via”, “using” and “with” act as delimiters for such phrases. For example, paper title, “Moses: Open source toolkit for statistical machine translation” (Koehn et al., 2007) represents an instance of the form X for Y, where Y is the application area. We also observe that the phrase succeeding “for” or preceding “using” or both (e.g., in “Decision procedures for dependency parsing using graded constraints” (Menzel and Schröder, 1998)) are likely to contain the name of an area.

Seed set creation: We create a seed set of the above functional keywords and use bootstrapped pattern learning to gather more such words along with areas. We had initially started with seven functional keywords and by bootstrapped pattern learning, augmented this to a final set of 11 functional keywords.

Ranking of the extracted phrases: Even though bootstrapped pattern learning identified potential area names, we observe large amount of noisy phrases such as, “machine translation system combination and evaluation”. Here, “machine translation” must be extracted from the surrounding noisy words. We notice that empirical ranking algorithms produce good results in extraction of the exact area names from long phrases. We employ three ranking schemes, described below:

  • Scheme 1: In this scheme, we rank according to individual -gram scores. The score of a given -gram () is calculated as:


    where, represents occurrence count of the -gram and the denominator represents total count of all the -grams.

  • Scheme 2: This scheme is very similar to previous scheme with an additional constraint that if the score of an -gram is greater than both of its border grams, then the border grams are left out. The intuition behind this is as follows: the trigram “word sense disambiguation” will have a higher score than its border bigrams, “word sense” and “sense disambiguation”, causing both these bigrams to be left out.

  • Scheme 3:

    We improve upon the previous scheme by estimating different threshold scores for each

    -gram. The thresholds are selected manually by observing the individual -gram lists. In Section 5.2, we shall compare the precision of each of these methods and we have finally adopted Scheme 3 since it gives the best results. We present 24 of the top 30 areas judged as accurate by domain experts:

Machine Translation, Natural Language Processing, Word Sense Disambiguation, Speech Recognition, Question Answering, Dependency Parsing, Information Extraction, Chinese Word Segmentation, Semantic Role Labeling, Information Retrieval, Entity Recognition, Word Alignment, Conditional Random Fields, Maximum Entropy, Coreference Resolution, Machine Learning, Dialogue Systems, Textual Entailment, Natural Language Understanding, Active Learning, POS Tagging, Relation Extraction, Sentiment Analysis, Sense Induction

4.2. Categorizing papers on the basis of areas

In this phase, we assign individual papers to one of the discovered areas. Individual papers are categorized to their corresponding areas on the basis of two strategies – direct match and relevance as per the language models, defined for various areas.

Direct match:

In the direct match approach, we search for an explicit string match between the title or abstract and one of the areas. In case we do not find a match in the title, we check for a direct match with the abstract of the paper. If the abstract contains only one such matching area then the paper is categorized to that area. On the other hand, if the title or the abstract contains more than one direct match with the set of area names then we further use the language modeling approach (discussed next) to classify that paper.

Language modeling: In this approach, we create a language model for each area, and classify a document into one of these areas. To create a language model for each area, we select the papers which could be classified on the basis of a single direct match. The titles and abstracts of all the papers belonging to one area are taken together to construct the language model of that area with the Jelinek-Mercer (JM) smoothing.

A document not categorized using direct match is treated as a query, consisting of the words in its title and abstract. After experimenting on a small set of sample papers, we fixed for JM smoothing to 0.7 (Zhai and Lafferty, 2004)

. The prior probability

for an application area , is proportional to the number of papers which were assigned to that area by a single direct match of either the title or the abstract. Hence, given a query paper , the area which scores the highest


is assigned as the area for the given paper.

4.3. Creation of a ranked list of techniques

This extraction phase is based on the idea of method papers. We classify a paper as method paper, if it introduces a novel technique or provides a toolkit in an area of computational linguistics. For instance, the paper introducing the Stanford CoreNLP toolkit is one such relevant example.

We observe two characteristics for these papers – one, they are expected to have been cited a number of times which is above some threshold () thus indicating that the technique introduced or improved upon is frequently used and second, the fraction of times they obtain their citations in the “methodology” section of other papers is above some threshold (%), thereby, indicating that they are primarily “method papers”. In the current framework, we select and % based on extensive experiments on the AAN dataset. We assume that when citing paper applies a technique from the cited paper, it cites that paper and also mentions the name of the technique in the citation context (i.e., the sentence where the citation is made). Our objective is to extract all the techniques a method paper is used for, from the citation context(s). We now describe the algorithm in detail.

For every method paper in the corpora, we extract all the citation contexts where this paper has been cited. We observe that usually the techniques are represented as noun phrases in the citation contexts. For example, in the citation context, “For English, we used the Dan Bikel implementation of the Collins parser (Collins, 2003).”, we obtain three noun phrases: 1) Dan Bikel implementation, 2) Collins parser, and 3) Collins. We build a global vector of noun phrases across all citation contexts for all the method papers. We consider this global vector as the ranked list of all the techniques used in the computational linguistics domain. The

component of the vector is the raw count of the noun phrase, ordered lexicographically, over the method citations of the entire corpora. Some of the top ranking noun phrases are:

Penn Treebank, Stanford Parser, Rate Training, Berkeley Parser, Machine Translation, Statistical Machine Translation, Charniak Parser, Moses Toolkit, Word Sense Disambiguation, Maximum Entropy, IBM Model, Bleu Score, Perceptron Algorithm, Word Alignment, Stanford POS Tagger, Collins Parser, Natural Language Processing, Bleu Metric, Coreference Resolution, Moses Decoder, Giza++ Toolkit, Brill Tagger, TnT Tagger,Anaphora Resolution, MST Parser, CCG Parser, Malt Parser, Minimum Error Rate Training

4.4. Categorizing papers on the basis of techniques

To identify the techniques for which a paper is used, we extract all the noun phrases present in all the citation context(s) where this paper has been cited. We build a similar vector of these noun phrases where the component of the vector is the raw count of that noun phrase drawn from the global vector introduced in the previous section. If a particular noun phrase from the global vector is missing in the citation contexts for , its weight is set to zero. We take dot product between this local vector of and the global vector to get a ranked list of possible techniques for . Finally, we choose top techniques on this rank list as the techniques the paper is used for.

The four phases resulted into a knowledge base that consists of a list of areas, a mapping between individual papers to the list of areas, a list of techniques and a mapping between individual papers to the list of techniques. We can employ this generated knowledge base in multiple information retrieval tasks. Section 6.1 demonstrate construction of one such IR system.

5. Evaluation Results

In this section, we present extensive evaluations carried out on our proposed system. Section 5.1 discusses general evaluation guidelines along with summary of human judgment experiment settings.

5.1. Evaluation setup

As described in Section 4, the entire framework is organized into four phases. Therefore, we evaluate each phase individually using human judgment experiments. For first and third phase, two subject experts (the first and the second author) are employed. For second and fourth phase, we float an online survey among six subject experts (four PhD and two under-graduate students). Each subject expert has evaluated 20 paper-area and ten paper-technique assignments. In total, we evaluate 120 paper-area and 60 paper-technique assignments.

5.2. Evaluation of the ranked list of areas

First, we conduct experiment to understand the relative performances of the three schemes described in Section 4.1 for creation of the ranked list. Scheme 3 (80%) outperforms scheme 1 (57%) and 2 (73%) in terms of precision. Therefore, we employ scheme 3 for the creation of the ranked list in the subsequent stages.

We evaluate the ranked list of the potential areas in the computational linguistics domain extracted from the ACL corpora. We employ precision-recall measures for the purpose of evaluation. For computing recall, however, due to limited human resource for this challenging task of labeling areas for the entire corpus of papers, we select a random set of 200 research papers and manually222The second author participated in labeling task. identified each of their areas. In total, we find 23 distinct areas (comparable to Wikipedia list of 32 popular tasks333 Scheme 3 identified 20 out of 23 areas, achieving a high recall of 87%.

Precision was computed by measuring fraction of correctly identified areas in the top area list. Table 1 presents the values of precision obtained for and top application areas. As we can observe, majority of correct areas are ranked higher by our ranking methodology.

K Precision (%)
Areas Techniques
25 84 80
50 72 64
75 51 48
100 43 41
Table 1. The precision values for = 25, 50, 75 and 100 for extraction of the list of application areas (Scheme 3) and techniques.

We also employed another domain expert to annotate first 30 results independent of the first judge. Inter-annotator agreement (Cohen’s kappa coefficient) was calculated and the value of came out to be 0.79. The matrix with the agreement/disagreement count between the experts is presented in Table  2.

Domain Expert 2 Yes No Total Domain Expert 1 Yes No Total
Table 2. The matrix of agreement and disagreement between two domain experts for annotation of area list.

5.3. Evaluating the extraction of areas from individual papers

Next, we evaluate our area assignment phase. As described in Section 5.1, out of the 120 expert assignments, 88 (73.3%) assignments were marked as correct.

5.4. Evaluating the list of techniques

This evaluation task is similar to the evaluation of the ranked list of application areas (see Section 5.2). However, in this case, recall calculation is difficult if we work with the top techniques for each method paper. To simplify the task, we proceed to calculate recall for only the highest ranked technique for each method paper. Again, due to resource constraints, we select a small random set of 30 papers and aggregate all their citation contexts from the method sections of the citing papers. Annotation of this random set resulted into 26 introduced or improved distinct techniques. Technique extraction algorithm obtained 21 out of 26 techniques resulting in a recall of 80.7%. Table 1 shows the precision obtained for the technique extraction algorithm for various values of . As we can observe, majority of the correct techniques are ranked higher by our ranking algorithm.

Here again we asked another domain expert to annotate the results independent of the first judge. We also calculated the inter-annotator agreement (Cohen’s kappa coefficient) for the top 25 techniques and came out to be 0.65. The matrix of agreement/disagreement counts is presented in Table 3.

Domain Expert 2 Yes No Total Domain Expert 1 Yes No Total
Table 3. The matrix of agreement and disagreement between two domain experts for annotation of technique list.

5.5. Evaluating the extraction of techniques from a method paper

For this evaluation, we employ subject experts as described in Section 5.1. We achieve a moderate accuracy of 60% on set of random 60 paper-techniques assignments.

6. Use Case

In this section, we present two use cases. In the first use case, we demonstrate construction of an example information retrieval system. In the second use case, we analyze the evolution of the application areas and the corresponding techniques over a given time-period.

6.1. An example information retrieval system

We demonstrate the construction of an information retrieval (IR) system that takes area name as an input and outputs a list of tools and techniques. An example of input/output of such IR system could be: Machine Translation “Word Alignment”, “Gale Church Algorithm”, “Bleu Score”, “Moses Toolkit” etc. We propose a count update based algorithm to construct this IR system. More specifically, for each paper , we find its area and all the techniques of the method papers that it cites in its methodology section and append all these techniques to the list corresponding to the extracted area for this paper.

Result: list of techniques for that area
initialization ;
for  do
       for  do
       end for
end for
Algorithm 1 Algorithm to generate list of techniques given area name.

Here, function Area(P) returns area of a paper P. Function Technique(M) returns techniques introduced or improved upon by a method paper M. Function MethodPapersCitedBy(P) returns all the method papers cited by paper P in its methodology section. A simple variation of Algorithm 1 by keeping track of the number of times a particular technique features in an area can potentially trace most popular techniques for an area.

In Table 4, we present some of the input/output examples from higher ranked areas of Computational Linguistics. As we see from these examples, the techniques extracted consist of sub-tasks, tools and datasets popularly used in an area. Also, it is interesting to observe that the extracted techniques span a wide range of time, for example, techniques like “Collins Parser”, “Berkeley Parser”, “Charniak Parser”, “Stanford Parser”, “MST Parser” and “Malt Parser” are introduced in Dependency Parsing at substantially different time periods.

Area Techniques Machine Translation Bleu Score, Rate Training, IBM Model, Word Alignment, Moses Toolkit, Inversion Transduction Grammar, Bootstrap Resampling, Translation Model, PennTreebank, Translation Quality, Language Model, Gale Church Algorithm Dependency Parsing Penn Treebank, Malt Parser, Berkeley Parser, MST Parser, Charniak Parser, Collins Parser, Maximum Entropy, Nivre’s Arc-Eager, Stanford Parser, Perceptron Algorithm

Multi-document Summarization

Topic Signatures, Information Extraction, Page Rank, Klsum Summarization System, Mead Summarizer, Word Sense Disambiguation, Lexical Chains, Inverse Sentence Frequency
Word Sense Disambiguation Coarse Senses, Semcor Corpus, Senseval Competitions, Cemantic Similarity, Micro Context, Maximum Entropy, Mutual Information Sense Induction

Word Sense Disambiguation, SemEval Word Sense Induction, Chinese Whispers, Recursive Spectral Clustering, Topic Models, Graded Sense Annotation, Ontonotes Project

Opinion Mining Sentiment Analysis, Mutual Information, Spin Model, Subjectivity Lexicon, Semantic Role Analysis, Multiclass Clasifier, Coreference Resolution, Latent Dirichlet Chinese Word Segmentation Entity Recognition, Conditional Random Fields, Segmentation Bakeoff, Stanford Chinese Word Segmenter, Perceptron Algorithm , Discourse Segmentation, CRF model
Table 4. Example application areas and corresponding techniques from AAN dataset

6.2. Temporal Analysis

We analyze evolution of application areas and techniques over a given time-period. Below, we present three temporal scenarios.

6.2.1. Evolution of areas

From the list of popular areas (based on the total number of papers published in an area) in aan, we present six representative areas, namely, Machine Translation, Dependency Parsing, Speech Recognition, Information extraction, Summarization and Semantic Role Labeling, and study their popularity (percentage of papers in that area for that time period out of total papers published in that time period) from 1980-2013 in 5-year windows. Figure 1 demonstrates the temporal variations for these areas and how they evolve with time.

Observations: While areas like Machine Translation and Dependency Parsing are on the rise, Information extraction and Semantic Role Labeling are on a decline. A further interesting observation is that till 1994, the ACL community had a lot of interest in Speech Recognition which then saw a sharp decline possibly because of the fact that the speech community slowly separated out.

Figure 1. Evolution of different application areas over time in terms of fraction of publications. Machine Translation and dependency parsing are on the rise, information extraction and semantic role labeling are on a decline. ACL community gradually separates out from Speech community.

AAN 1975-1984 1985-1994 1995-2004 2005-2013 ACL 1975-1984 1985-1994 1995-2004 2005-2013 COLING 1965-1974 1975-1984 1985-1994 1995-2004
Figure 2. Phrase-Clouds representing the proportion of papers for an area across various time periods for the complete AAN dataset as well as ACL and COLING conferences. ACL seems to be more interested in the areas such as Machine Translation and Dependency Parsing over the recent decades. COLING community also seems more interested towards areas like Machine Translation and Dependency Parsing along with Bilingual Lexicon Extraction in the recent decades.

6.2.2. Evolution of major areas in top conferences

We shortlist two top-tier conferences in the computational linguistics domain, namely, the Annual Meeting of the Association of Computational Linguistics (ACL) and the International Conference on Computational Linguistics (COLING). We study 40 years of conference history by dividing into four 10-year buckets. Next, for each conference, we extract top ten most popular areas (based on citation counts) for each 10-year bucket. Figure 2 presents phrase clouds representing evolution of areas in these two conferences in comparison to the full AAN dataset itself. Some of the interesting observations from this analysis are:

  • Full AAN dataset: Here, we observe that while in the earlier decades, areas such as Semantic Role Labeling, Evaluation of Natural Language and Speech Recognition were dominant, they fade away in the recent decades. On the other hand, areas such as Machine Translation and Dependency Parsing, which were less prevalent in the earlier decades gain significant importance in the recent decades. We also see Sentiment Analysis as one of the major areas in the last decade.

  • ACL: In the earlier decades, this community was interested in areas like Linguistic Knowledge Sources and Semantic Role Labeling. Over the recent decades, however, it seems to be more interested in areas such as Machine Translation and Dependency Parsing. Interestingly, in the time period 2005 – 2013, an upcoming area of Social Media is found to gain importance.

  • COLING: Areas like Lexical Semantics and Linguistic Knowledge Sources were of interest to the community in the earlier decades. However, in the recent years, areas like Machine Translation, Dependency Parsing and Bilingual Lexicon Extraction have gained importance. An interesting observation here is that Semantic Role Labeling has been all through a thrust area for this particular conference.

Area Techniques 1990-1994 1995-1999 2000-2004 2005-2009 2010-2013 Dependency Parsing Dependency Unification Grammar, Kasper Algorithm, Left Corner Parser, Inheritence Systems, Eurotra Project Penn Treebank, Probabilistic Context Free Grammar, Tree Substitution Grammar, Conditional Random Fields, Dependency Links, Collins Parser Penn Treebank, Collins Parser, Berkeley Parser, Charniak Parser, Maximum Entropy, NEGRA Corpus Penn Treebank, Charniak Parser, Malt Parser, MST Parser, Berkeley Parser, Stanford Parser, CCG Parser, Nivre’s arc-eager Penn Treebank, Malt Parser, MST Parser, Berkeley Parser, Charniak Parser, Stanford Parser, Perceptron Algorithm, Nivre’s arc-eager Machine Translation Parse-parse-match Approaches, Early Type Deduction, Bottom-up Head Driven Algorithm, Bilingual signs IBM Model, Inversion Transduction Grammar, Word Alignment, Sentence Alignment, Moses Toolkit Word Alignment, Bleu Score, Inversion Transduction Grammar, Parse-parse-match Approaches Rate Training, IBM Model, Bleu Score, Word Alignment, Inversion Transduction Grammar, Moses Toolkit Bleu Score, Rate Training, Moses Toolkit, Word Alignment, Bootstrap Resampling, IBM Model Sentiment Analysis Early Type Deduction Mechanisms, Unification Grammars, Sentence Plan Language, Mutual Information, Taxonomy Files Levenshtein Distance, Discourse Structure Mutual Information, Information Extraction, Penn Treebank, Distributional Similarity, Statistical Parser Mutual Information, Word Sense Disambiguation, Subjectivity Lexicon, Latent Dirichlet , Spin Model Mutual Information, Word Sense Disambiguation, Subjectivity Lexicon, Latent Dirichlet, Polarity Lexicons Cross-lingual Textual Entailment

Ordinary Dictionary, Text Generation, Dependency Unification Grammar, Machine Translation

Discourse Structure, Encode TFS, Temporal Information, English Texts, Kappa Coefficient, CUE Phrases

Mutual Information, Manual Annotation, Distributional Similarity, Heuristic Approaches

Word Sense Disambiguation, Machine Translation, Textual Entailment Challenge Semantic Textual Similarity, Verb Ocean, Moses Toolkit, Machine Translation
Grammatical Error Correction Probabilistic Context Free Grammars, Parseval Metric, Brill POS Tagger Penn Treebank, Prepositional Phrase Attachment, Collins Parser Penn Treebank, Brill Tagger, FNTBL Toolkit, Charniak Parser, Kappa Statistics Penn Treebank, Word Sense Disambiguation, Charniak Parser, OOV words English corpus, CLC FCE dataset, OOV words, Berkeley Parser, Charniak Parser
Table 5. A few examples of areas and their top techniques for different time periods. “Penn Treebank” is extensively used for Dependency Parsing and Grammatical Error Correction across almost all time periods. “Moses Toolkit” and “IBM Model” are both popular techniques across most time periods in Machine Translation. “Mutual information” found important use in Sentiment Analysis.

6.2.3. Evolution of techniques in areas

In the second use case, we study evolution of techniques for a given area. For this analysis, we divide the time-line into fixed buckets of years. Next, for each bucket, we extract popular techniques (based on the number of times any paper has cited that technique) using our proposed system. Table 5 presents the popular techniques for five example areas. Some of the interesting trends from Table 5 are listed below:

  • Dependency Parsing: New techniques like “Malt Parser”, “Minimum Spanning Tree (MST) Parser”, etc. came into existence in 2005 – 2009. In the next year bucket, these parsers overcome popularity of previous parsers such as “Collin’s Parser”, “Berkeley Parser” and are almost at par with “Charniak Parser”. In addition, we observe that the “Penn Treebank” is extensively used for Dependency Parsing across almost all time periods.

  • Machine Translation: We found that “Word Alignment” and “Inversion Transduction Grammar” are popular techniques for Machine Translation across all time periods. Also, “Bleu Score” has been a popular technique since its introduction in 2000 – 2004. Similarly, “Moses Toolkit” and “IBM Model” are both popular techniques across most time periods.

  • Sentiment Analysis: In this area, “Mutual Information” and “Word Sense Disambiguation” are popular techniques for most of the time periods. “Latent Dirichlet Allocation” (introduced in 2003) found important use in Sentiment Analysis in 2005 – 2009. Also the “Spin Model” got popularity in 2005 – 2009.

  • Cross Lingual Textual Entailment: “Distributional Similarity” and “Mutual Information” are important techniques and are popular in multiple time periods. “Verb Ocean” gets popular in 2005 – 2009 and 2009 – 2013. It is also very interesting to note that “Machine Translation” is actually an important tool for this area and is very popular in 2005 – 2009. However, in 2010 – 2013 its popularity goes down. A probable explanation for this could be the introduction of techniques which perform Cross-lingual Textual Entailment without “Machine Translation” (Mehdad et al., 2012).

  • Grammatical Error Correction: Techniques to address out-of-Vocabulary (OOV) words have become important in recent times. Over the years, “Collins Parser” got replaced by “Charniak Parser” and finally by “Berkeley Parser”. “Penn Treebank” is an important dataset for this area.

7. Conclusion and Future Work

In this paper, we have proposed a rule-based information extraction system to extract application areas and techniques from scientific articles. The system extracts ranked list of all application areas in the computational linguistics domain. At a more granular level, it also extracts application area for a given paper. We evaluate our system with domain experts and prove that it performs reasonably well on both precision and recall. As a use case, we present an extensive analysis of temporal variation in popularity of the techniques for a given area. Some of the interesting observation that we make here are that the areas like Machine Translation and Dependency Parsing are on the rise of popularity while areas like Speech Recognition, Linguistic Knowledge Sources and Evolution of Natural Language are on the decline.

In future, we plan to work on constructing a multi-level mapping table that maps application areas to techniques and further techniques to a set of parameters. For example, Machine Translation (application area) has “Bleu Score” as one of its techniques. Bleu Score is a algorithm that takes few input parameters. Changing these parameters will change the outcome of the score. Example of one such parameter is , which represents the value of for the -grams.

All our methods can be generalized to domains other than computational linguistics. We plan to build an online version of AppTechMiner in near future. We also plan to study temporal characteristics of techniques for a given application area to observe if future predictions can be made for a technique - whether its popularity will increase or decrease in the years come.


  • (1)
  • Caraballo (1999) Sharon A Caraballo. 1999. Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 120–126.
  • Chiticariu et al. (2013) Laura Chiticariu, Yunyao Li, and Frederick R. Reiss. 2013. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. 827–832.
  • Cohen and Hersh (2005) Aaron M Cohen and William R Hersh. 2005. A survey of current work in biomedical text mining. Briefings in bioinformatics 6, 1 (2005), 57–71.
  • Friedman et al. (2001) Carol Friedman, Pauline Kra, Hong Yu, Michael Krauthammer, and Andrey Rzhetsky. 2001. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, suppl 1 (2001), S74–S82.
  • Fukuda et al. (1998) Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Tamura, Toshihisa Takagi, and others. 1998. Toward information extraction: identifying protein names from biological papers. In Pac symp biocomput, Vol. 707. Citeseer, 707–718.
  • Gaizauskas et al. (2003) Robert Gaizauskas, George Demetriou, Peter J. Artymiuk, and Peter Willett. 2003. Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19, 1 (2003), 135–143.
  • Gupta and Manning (2011) Sonal Gupta and Christopher D Manning. 2011. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers.. In IJCNLP. 1–9.
  • Gupta and Manning (2014) Sonal Gupta and Christopher D Manning. 2014. Spied: Stanford pattern-based information extraction and diagnostics. Sponsor: Idibon 38 (2014).
  • He (2007) Xiaodong He. 2007. Using word dependent transition models in HMM based word alignment for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, 80–87.
  • Hearst (1992) Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 (COLING ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 539–545. DOI: 
  • Jin et al. (2013) Yiping Jin, Min-Yen Kan, Jun-Ping Ng, and Xiangnan He. 2013. Mining Scientific Terms and their Definitions: A Study of the ACL Anthology. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 780–790.
  • Jones (2005) Rosie Jones. 2005. Learning to extract entities from labeled and unlabeled text. Ph.D. Dissertation. Citeseer.
  • Kim et al. (2010) Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. 2010. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 21–26.
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, and others. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 177–180.
  • Krallinger et al. (2008) Martin Krallinger, Alfonso Valencia, and Lynette Hirschman. 2008. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome biology 9, Suppl 2 (2008), 1–14.
  • Lopez and Romary (2010) Patrice Lopez and Laurent Romary. 2010. HUMB: Automatic key term extraction from scientific articles in GROBID. In Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics, 248–251.
  • Mehdad et al. (2012) Yashar Mehdad, Matteo Negri, and José Guilherme C de Souza. 2012. FBK: cross-lingual textual entailment without translation. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics, 701–705.
  • Menzel and Schröder (1998) Wolfgang Menzel and Ingo Schröder. 1998. Decision procedures for dependency parsing using graded constraints. In in proceedings of ACL’90. Citeseer.
  • Müller et al. (2008) Hans-Michael Müller, Arun Rangarajan, Tracy K Teal, and Paul W Sternberg. 2008. Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6, 3 (2008), 195–204.
  • Qazvinian and Radev (2008) Vahed Qazvinian and Dragomir R Radev. 2008. Scientific paper summarization using citation summary networks. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 689–696.
  • Qazvinian et al. (2010) Vahed Qazvinian, Dragomir R Radev, and Arzucan Özgür. 2010. Citation summarization through keyphrase extraction. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 895–903.
  • Radev et al. (2009) Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian. 2009. The ACL Anthology Network Corpus. In Proceedings, ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries. Singapore.
  • Schoenemann (2013) Thomas Schoenemann. 2013. Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment.. In ACL (1). 22–31.
  • Shah et al. (2003) Parantu K. Shah, Carolina Perez-Iratxeta, Peer Bork, and Miguel A. Andrade. 2003. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics 4, 1 (2003), 1–9. DOI: 
  • Teufel et al. (2000) Simone Teufel and others. 2000. Argumentative zoning: Information extraction from scientific text. Ph.D. Dissertation. Citeseer.
  • Wimalasuriya and Dou (2010) Daya C Wimalasuriya and Dejing Dou. 2010. Ontology-based information extraction: An introduction and a survey of current approaches. Journal of Information Science (2010).
  • Zhai and Lafferty (2004) Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22, 2 (2004), 179–214.