For an ever increasing spectrum of applications (e.g., medical text analysis, opinion mining, sentiment analysis, social media text analysis, customer intelligence, fraud analytics etc.) mining and analysis of unstructured natural language text data is necessary[1, 2, 3].
One of key challenge while designing such text analytics (TA) applications is to identify right set of features. For example, for text classification
problem, different sets of features have been considered in different works (spanning a history of more than twenty years) including ‘bag of words’, ‘bag of phrases’, ‘bag of n-grams’, ‘WordNet based word generalizations’, and ‘word embeddings’[4, 5, 6, 7, 8]
. Even for recent end-to-end designs using deep neural networks, specification of core features remains manually driven[9, 10]. During feature engineering, often data scientists manually determine which features to use based upon their experience and expertise with respect to the underlying application domain as well as state-of-the-art tools and techniques. Different tools (e.g., NLTK , Mallet , Stanford CoreNLP , Apache OpenNLP , Apache Lucene 
, etc.) available to a NLP data scientist for TA application design and development often differ in terms of support towards extraction of features, level of granularity at which feature extraction process is to be specified; and these tools often use different programing vocabularies to specify semantically equivalent features.
Currently, there is no generic method or approach, which can be applied during TA application’s design process to define and extract features for any arbitrary application in an automated or semi-automated manner. Even there is no single way to express wide range of NLP features, resulting into increased efforts during feature engineering which has to start new for each data scientist and automated reuse of features across semantically similar or related applications designed by different data scientists is difficult. This also hinders foundational studies on NLP feature engineering including why certain features are more critical than others.
In this paper, we aim to present an approach towards automating NLP feature engineering. We start with an outline of a language for expressing NLP features abstracting over the feature extraction process, which often implicitly captures intent of the NLP data scientist to extract specific features from given input text. We next discuss a method to enable automated reuse of features across semantically related applications when a corpus of feature specifications for related applications is available. Proposed language and system would help achieving reduction in manual effort towards design and extraction of features, would ensure standardization in feature specification process, and could enable effective reuse of features across similar and/or related applications.
2 Life Cycle View
Figure 1 depicts typical design life cycle of a (traditional) ML based solution for the TA applications, which involves steps to manually define relevant features and implement code components to extract those feature from input text corpus during training, validation, testing and actual usage of the application. In traditional ML based solutions, feature interactions also need to be explicitly specified, though this step is largely automated when using deep neural network based solutions .
As the process of defining features is manual, prior experience and expertize of the designer affects which features to extract and how to extract these features from input text. Current practice lacks standardization and automation in feature definition process, provides partial automation in extraction process, and does not enable automated reuse of features across related application.
Next, let us consider scenarios when features are specified as elements of a language. Let us refer to this language as NLP Feature Specification Language (nlpFSpL) such that a program in nlpFSpL would specify which features should be used by the underlying ML based solution to achieve goals of the target application. Given a corpus of unstructured natural language text data and a specifications in the nlpFSpL, an interpreter can be implemented as feature extraction system (FExSys) to automatically generate feature matrix which can be directly used by underlying ML technique.
3 NLP Feature Specification Language
3.1 Meta Elements
Figure 3 specifies the meta elements of the nlpFSpL which are used by the FExSys while interpreting other features.
Analysis Unit (AU) specifies level at which features have to be extracted. At Corpus level, features are extracted for all the text documents together. At Document level, features are extracted for each document in corpus separately. At Para (paragraph) level Features are extracted for multiple sentences constituting paragraphs together. At Sentence level features to be extracted for each sentence. Figure 4 depicts classes of features considered in nlpFSpL and their association with different AUs.
Syntactic Unit (SU) specifies unit of linguistic features. It could be a ‘Word’ or a ‘Phrase’, or a ‘N-gram’ or a sequence of words matching specific lexico-syntactic pattern captured as ‘POS tag pattern’ (e.g., Hearst pattern ) or a sequence of words matching specific regular expression ‘Regex’ or a combination of these. Option Regex is used for special types of terms, e.g., Dates, Numbers, etc. LOGICAL is a Boolean logical operator including AND, OR and NOT (in conjunction with other operator). For example, Phrase AND POS Regex would specify inclusion of a ‘Phrase’ as SU when its constituents also satisfy ’regex’ of ‘POS tags’. Similarly, POS Regex OR NOT(Regex) specifies inclusion of sequence of words as SU if it satisfies ‘POS tag Pattern’ but does not match pattern specified by character ‘Regex’. Note that SU can be a feature in itself for document and corpus level analysis.
Normalize Morphosyntactic Variants: If YES, variants of words including stems, lemmas, and fuzzy matches will be identified before analyzing input text for feature exaction and would be treated equivalent.
3.2 Feature Types
3.2.1 Linguistic Features
Figure 5 depicts two levels of taxonomy for features considered as linguistic.
To illustrate, let us consider context based features: Table 6 gives various options which need to be specified for directing how context for an SU should be extracted. For example, Context_Window := [2, Sentence] will extract all tokens within current sentence, which are present within a distance of 2 on both sides from the current SU. However, Context_Window := [2, Sentence]; POSContext := NNVB will extract only those tokens within current sentence, which are present within a distance of 2 on both sides from the current SU and have POS tag either NN (noun singular) or VB (verb, base form).
Table 7 illustrates how to specify the way head directionality of current SU should be extracted.
3.2.2 Semantic Similarity and Relatedness based Features
Semantic similarity can be estimated between words, between phrases, between sentences, and between documents in a corpus. Estimation could either be based upon corpus text alone by applying approaches like vector space modeling, latent semantic analysis , topic modeling , or neural embeddings (e.g., Word2Vec  or Glove ) and their extensions to phrase, sentence, and document levels. Otherwise it can be estimated based upon ontological relationships (e.g., WordNet based ) among concept terms appearing in the corpus.
3.2.3 Statistical Features
Figure 8 depicts different types of statistical features which can be extracted for individual documents or corpus of documents together with methods to extract these features at different levels.
In particular, examples of distributions which can be estimated include frequency distributions for terms, term distributions in topics and topic distributions within documents, and distribution of term inter-arrival delay, where inter arrival delay for a term measures number of terms occurring between two successive occurrences of a term.
4 Illustration of nlpFSpL Specification
Let us consider a problem of identifying medical procedures being referenced in a medical report.
Sample Input (Discharge summary): ““This XYZ non-interventional study report from a medical professional. Spontaneous report from a physician on 01-Oct-1900. Patient condition is worsening day by day. Unknown if it was started before or after levetiracetam initiation. The patient had a history of narcolepsy as well as cataplexy. Patient condition has not recovered. On an unknown date patient was diagnosed to have epilepsy. The patient received the first dose of levetiracetam for seizure."
Table 1 shows specification of features in nlpFSpL.
|This||XYZ, non, -|
|XYZ||This, non, -, interventional|
|non||This, XYZ, -, interventional, study|
|-||This, XYZ, non, interventional, study, report|
|interventional||XYZ, non, -, study, report, from|
|study||non, -, interventional, report, from, a|
|report||interventional, study, from, a, medical|
|from||interventional, study, report, a, medical, professional|
|a||study, report, from, medical, professional, .|
|medical||report, from, a, professional, ., Spontaneous|
|professional||from, a, medical, ., Spontaneous, physician|
|.||a, medical, professional, Spontaneous, physician, on|
5 NLP Feature Reuse across TA Applications
Next let us consider a case for enabling automated reuse of feature specifications in nlpFSpL across different semantically related applications.
5.1 Illustrative Example
To illustrate that semantically different yet related applications may have significant potential for reuse of features, let us consider the problem of event extraction, which involves identifying occurrences of specific type of events or activities from raw text.
Towards that, we analysed published works on three different types of events in different domains as described next:
- Bio-molecular Interactions
Objective of this study is to design a ML model for identifying if there exist mentions of one of the nine types of bio-molecular interactions in (publicly available) Biomedical data. To train SVM based classifier, authors use GENETAG database, which is a tagged corpus for gene/protein named entity recognition. BioNLP 2009 shared task test-set was used to estimate performance of the system. Further details can be found at.
- Financial Events in News
Objective of the study was to design ML model for enabling automated detection of specific financial events in the news text. Ten diﬀerent types of ﬁnancial events were considered including announcements regarding CEOs, presidents, products, competitors, partners, subsidiaries, share values, revenues, proﬁts, and losses. To train and test SVM and CRF based ML models, authors used data set consisting of 200 news messages extracted from the Yahoo! Business and Technology newsfeeds, having financial events and relations manually annotated by 3 domain experts. Further details can be found at .
- Events from Twitter Data
Objective of the study was to design an ML based system for extracting open domain calendar of significant events from Twitter-data. 38 different types of events were considered for designing the system. To train the ML model, an annotated corpus of 1000 tweets (containing 19,484 tokens) was used and trained model was tested on 100 million most recent tweets. Further details can be found at .
below depicts classes of features selected by authors of these works (as described in the corresponding references above) to highlight the point that despite domain differences, these applications share similar sets of features. Since authors of these works did not cite each other, it is possible that that these features might have been identified independently. This, in turn, supports the hypothesis that if adequate details of any one or two of these applications are fed to a system described in this work, which is designed to estimate semantic similarities across applications, system can automatically suggest potential features for consideration for the remaining applications to start with without requiring manual knowledge of the semantically related applications.
5.2 Reuse Process
Figure 9 depicts overall process flow for enabling automated feature recommendations.
For a new text analytics application requiring feature engineering, it starts with estimating its semantic proximity (from the perspective of a NLP data scientist) with existing applications with known features. Based upon these proximity estimates as well as expected relevance of features for existing applications, system would recommend features for the new application in a ranked order. Furthermore, if user’s selections are not aligned with system’s recommendations, system gradually adapts its recommendation so that eventually it can achieve alignment with user preferences.
Towards that let us start with characterizing text analytics applications. A TA application’s details should include following fields:
- Problem Description
Text based description of an TA application (or problem). For example, “identify medical procedures being referenced in a discharge summary” or “what are the input and output entities mentioned in a software requirements specification”.
- Annotation Level
Analysis unit at which features are to be specified and training annotations are available, and ML model is designed to give outcomes. Options include word, phrase, sentence, paragraph, or document.
- Problem Type
Specifies technical classification of the underlying ML challenge with respect to a well-defined ontology. E.g., Classification (with details), Clustering, etc.
- Performance Metric
Specifies how performance of the ML model is to be measured - again should be specified as per some well defined ontology.
Knowledge base of text analytics applications contains details for text analytics applications in above specified format. Each application is further assumed to be associated with a set of Features (or feature types) specified in nlpFSpL together with their relevance scores against a performance metric. Relevance score of a feature is a measure of the extent to which this feature contributes to achieve overall performance of ML model while solving the underlying application. Relevance score may be estimated using any of the known feature selection metrics .
To specify knowledge base formally, let us assume that there are different applications and unique feature specifications across these applications applying same performance metric. Let us denote these as follows: and = respectively. Knowledge base is then represented as a feature co-occurrence matrix such that is the relevance score of feature specification () for application .
5.3 Measuring Proximity between Applications
To begin, for each text based field in each TA application, pre-process text and perform term normalization (i.e., replacing all equivalent terms with one representative term in the whole corpus) including stemming, short-form and long-form (e.g., ‘IP’ and ‘Intellectual Property’), language thesaurus based synonyms (e.g., WordNet based ‘goal’ and ‘objective’).
5.3.1 Identify key Terms
Thereafter, we identify potential ‘entity-terms’ as ‘noun-phrases’ and ‘action terms’ as ‘verb-phrases’ by applying POS-Tagging and Chunking. E.g., In sentence – “This XYZ non-interventional study report is prepared by a medical professional”, identifiable entity terms are “this XYZ non-interventional study report” and “medical professional” and identifiable functionality is ‘prepare’.
5.3.2 Generate Distributed Representations
Analyze the corpus of all unique words generated from the text based details across all applications in the knowledge base. Generally corpus of such textual details would be relatively small, therefore, one can potentially apply pre-trained word embeddings (e.g., word2vec  or Glove ). Let be the neural embedding of word in the corpus. We need to follow additional steps to generate term-level embeddings (alternate solutions also exist ): Represent corpus into Salton’s vector space model  and estimate information theoretic weighing for each word using BM25  scheme: Let be the weight for word . Next update word embedding as . For each multi-word term , generate term embedding by averaging embeddings of constituent words: .
In terms of these embeddings of terms, for each text based field of each application in the knowledge base, generate field level embedding as a triplet as follows: Let be a field of an application in . Let the lists of entity-terms and action-terms in be and respectively. Let remaining words in be: . Estimate embedding for as: =, , , where =, =, and =.
5.3.3 Estimating Proximity between Applications
After representing different fields of an application into embedding space (except AU), estimate field level similarity between two applications as follows: Let , , and , , be the representations for field for two applications , . In terms of these, field level similarity is estimated as = , , , where = 0 if field level details of either of the applications is unavailable else = , ; etc.
For the field - AU, estimate = else .
In terms of these, let = , , , , be overall similarity vector across fields, where refers to the field ‘problem description’ etc. Finally, estimate mean similarity across constituent fields as a proximity between corresponding applications.
5.4 Feature Recommendations
Let be new application for which features need to be specified in nlpFSpL. Represent fields of similar to existing applications in the knowledge base (as described earlier in the Section 5.3.2).
Next, create a degree-1 ego-similarity network for to represent how close is with existing applications in . Let this be represented as a diagonal matrix such that proximity between and application in the knowledge base (by applying steps in the Section 5.3.3).
Thereafter, let such that
measures probable relevance of featurefor w.r.t. performance metric based upon its relevance for . When there are multiple applications in , we need to define a policy to determine collective probable relevance of a feature specification in for based upon its probable relevance scores with respect to different applications.
To achieve that, let be the relevance of for based upon a policy, which can be estimated in different ways including following:
Next, consider there different example policies:
- Aggressive Policy
Weakest relevance across applications is considered:
- Conservative Policy
Strongest relevance across applications is considered:
- Probable Policy
Most likely relevance across applications is considered:
Rank feature specifications in decreasing order based upon , which are suggested to the NLP Data Scientist together with the supporting evidence.
5.5 Continuous Learning from User Interactions
There are two different modes in which user may provide feedback to the system with respect to recommended features: one where it ranks features differently and second where user provides different relevance scores (e.g., based upon alternate design or by applying feature selection techniques). Aim is to use these feed-backs to learn an updated similarity scoring function .
In relation to , let return rank of a feature and return relevance score of a feature based upon type – ‘system’ or ‘user’.
Next, for each , let be a hash table with keys as application ids and values as list of numbers estimated next. Also let contain updated similarity scores between and existing applications in .
For each feature specification , determine whether ‘user’ given rank is different from ‘system’ given rank, i.e., . If so, execute steps next.
Let be the list of applications, which contributed in estimating collective relevance for feature . For example, when aggressive or conservative policy is considered, = = .
For each : Add to , where is estimated as follows: If user provides explicit relevance scores for ,
Otherwise if user re-ranks features
For each : . If - i.e., when the difference between old and new similarity scores is more than fraction of original similarity, add to training set so that it used to train a regression model for by applying partial recursive PLS  with as set of predictor or independent variables and
as response variable. Existing proximity scores between applications in(ref. Section 5.3.3) are also added to training set before generating the regression model.
Note that is a small fraction 0 which controls when should similarity model be retrained. For example, would imply that if change in similarity is more than 5% only then it underlying model should use this feedback for retraining.
In this paper, we have presented high level overview of a feature specification language for ML based TA applications and an approach to enable reuse of feature specifications across semantically related applications. Currently, there is no generic method or approach, which can be applied during TA applications’ design process to define and extract features for any arbitrary application in an automated or semi-automated manner primarily because there is no standard way to specify wide range of features which can be extracted and used. We considered different classes of features including linguistic, semantic, and statistical for various levels of analysis including words, phrases, sentences, paragraphs, documents, and corpus. As a next step, we presented an approach for building a recommendation system for enabling automated reuse of features for new application scenarios which improves its underlying similarity model based upon user feedback.
To take this work forward, it is essential to have it integrated to a ML platform, which is being used by large user base for building TA applications so that to be able to populate a repository of statistically significant number of TA applications with details as specified in Section 5 and thereafter refine the proposed approach so that eventually it rightly enables reuse of features across related applications.
-  Michael W Berry and Jacob Kogan. Text Mining: Applications and Theory. John Wiley & Sons, 2010.
-  Charu C Aggarwal and ChengXiang Zhai. Mining text data. Springer Science & Business Media, 2012.
-  Steven Struhl. Practical Text Analytics: Interpreting Text and Unstructured Data for Business Intelligence. Kogan Page Publishers, 2015.
-  Sam Scott and Stan Matwin. Feature engineering for text classification. In ICML, volume 99, pages 379–388, 1999.
-  Alessandro Moschitti and Roberto Basili. Complex linguistic features for text classification: A comprehensive study. In European Conference on Information Retrieval, pages 181–196. Springer, 2004.
-  Xiao-Bing Xue and Zhi-Hua Zhou. Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3):428–442, 2009.
-  Yijun Xiao and Kyunghyun Cho. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv:1602.00367, 2016.
-  Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
A primer on neural network models for natural language processing.
Journal of Artificial Intelligence Research, 57:345–420, 2016.
-  Yoav Goldberg. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309, 2017.
-  Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. "O’Reilly Media, Inc.", 2009.
-  A. K. McCallum. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu., 2017.
-  Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
-  OpenNLP. Apache OpenNLP Toolkit. http://opennlp.apache.org/, 2017.
-  Lucene Core. Apache Lucene Core. https://lucene.apache.org/core/, 2017.
-  Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2, COLING ’92, pages 539–545, Stroudsburg, PA, USA, 1992. Association for Computational Linguistics.
-  Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
-  Thomas K Landauer, Peter W Foltz, and Darrell Laham. An Introduction to Latent Semantic Analysis. Discourse Processes, 25(2-3):259–284, 1998.
-  David M Blei and John D Lafferty. Topic Models. Text mining: classification, clustering, and applications, 10:71–87, 2009.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. arXiv, October 2013.
-  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics.
-  Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
-  Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii, and Tapio Salakoski. Complex event extraction at pubmed scale. Bioinformatics, 26(12):i382–i390, June 2010.
-  Frederik Hogenboom. Automated Detection of Financial Events in News Text. PhD thesis, Erasmus School of Economics, Burgemeester Oudlaan 50, 3062 PA Rotterdam, Netherlands, 2014.
-  Alan Ritter, Mausam, Oren Etzioni, and Sam Clark. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 1104–1112, New York, NY, USA, 2012. ACM.
-  Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
-  Mo Yu and Mark Dredze. Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3:227–242, 2015.
-  Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009.
-  S Joe Qin. Recursive pls algorithms for adaptive data modeling. Computers & Chemical Engineering, 22(4-5):503–514, 1998.