Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [Chang2015, Zhang2014, Grawe2017, Liu2020]. To ensure the accuracy and efficiency of such NLP tasks as indexing, topic modelling and information retrieval [Blanchard2007, Gerlach2019, Tsz-WaiLo2005, Fox1989, Wilbur1992]
, the uninformative words, often referred to as “stopwords”, need to be removed in the pre-processing step, in order to increase signal-to-noise ratio in the unstructured text data. Example stopwords include ”each”, ”about”, ”such” and ”the”. Stopwords often appear frequently in many different natural language documents or parts of the text in a document but carry little information about the part of the text they belong to.
The use of a standard stopword list, such as the one distributed with popular Natural Language Tool Kit (NLTK) [bird2009natural] python package, for removal in data pre-processing has become an NLP standard in both research and industry. There have been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [Fox1989, Kucera1969], 20 newsgroup corpus , books corpus [Montemurro2010], etc, and curate a generic stopword list for removal in NLP applications across fields. However, the technical language used in engineering or technical texts is different from layman languages and may use stopwords that are less prevalent in layperson languages. When it comes to engineering or technical text analysis, researchers and engineers either just adopt the readily available generic stopword lists for removal [Chang2015, Zhang2014, Grawe2017, Liu2020]
leaving many noises in the data or identify additional stopwords in a manual, ad hoc or heuristic manner[Blanchard2007, Sarica2020, Seki2005, Crow2004]. There exist no standard stopword list for technical language processing applications.
Here, we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative data-driven approaches. The resultant stopword list is statistically identified and human-evaluated. Researchers, analysts and engineers working on technology-related textual data and technical language analysis can directly apply it for denoising and filtering of their technical textual data, without conducting the manual and ad hoc discovery and removal of uninformative words by themselves.
2 Our approach
To identify stopwords in technical language texts, we statistically analyse the natural texts in patent documents which are descriptions of technologies at all levels. The patent database is vast and provides the most comprehensive coverage of technological domains. Specifically, our patent text corpus contains 781,156,082 tokens (words, bi-, tri- and four-grams) from 30,265,976 sentences of the titles and abstracts of 6,559,305 of utility patents in the complete USPTO patent database from 1976 to 31st December 2019 (access date: 23 March 2020). Non-technical design patents are excluded. Technical description fields are avoided because they include information on contexts, backgrounds and prior arts that may be non-relevant to the specific invention and repetitive, lead to statistical bias and increase computational requirements. We also avoided legal claim sections which are written in repetitive, disguising and legal terms.
In general text analysis for topic modelling or information retrieval, various statistical metrics, such as term frequency (TF) [Tsz-WaiLo2005, Wilbur1992], inverse-document frequency (IDF) [Tsz-WaiLo2005], term-frequency-inverse-document-frequency (TFIDF) [Blanchard2007], entropy [Gerlach2019, Montemurro2010], information content [Gerlach2019], information gain [Makrehchi2008] and Kullback-Leibler (KL) divergence [Tsz-WaiLo2005], are employed to sort the words in a corpus [Gerlach2019, Makrehchi2008]. Herein we use TF, TFIDF and information entropy to automatically identify candidate stopwords.
Furthermore, some of the technically significant terms such as “composite wall”, “driving motion” and “hose adapter” are statistically indistinguishable from such stopwords “be”, “and” and “for”, regardless of the statistic metrics for sorting. That is, automatic and data-driven methods by themselves are not accurate and reliable enough to return stopwords. Therefore, we also use a human-reliant step to further evaluate the automatically identified candidate stopwords and confirm a final set of stopwords which do not carry information on engineering and technology.
In brief, the overall procedure as depicted in Figure 1 consists of three major steps: 1) basic pre-processing of the patent natural texts, including punctuation removal, lower-casing, phrase detection and lemmatization; 2) using multiple statistic metrics from NLP and information theory to identify a ranked list of candidate stopwords; 3) term-by-term evaluation by human experts on their insignificance for technical texts to confirm stopwords that are uninformative about engineering and technology. In the following, we describe implementation details of these three steps.
The patent texts in the corpus are first transformed into a line-sentence format, utilizing the sentence tokenization method in the NLTK, and normalized to lowercase letters to avoid additional vocabulary caused by lowercase/uppercase differences of the same words. The punctuation marks in sentences are removed except “-” and “/”. These two special characters are frequently used in word-tuples, such as “AC/DC” and “inter-link”, which can be regarded as a single term. The original raw texts are transformed into a collection of 30,265,976 sentences, including 796,953,246 unigrams.
Phrases are detected with the algorithm of Mikolov et al [Mikolov2013] that finds words that frequently appear together, and in other contexts infrequently, by using a simple statistical method based on the count of words to give a score to each bigram such that:
where is the count of and appearing together as bigrams in the collection of sentences and is the count of in the collection of sentences. is the discounting coefficient to prevent too many phrases consisting of very infrequent words, and set to prevent having scores higher than 0 for phrases occurring less than twice. The term represents the total number of tokens in the patent database where is the count of the term in the patent . Bigrams with a score over a defined threshold are considered as phrases and joined with a “_” character in the corpus, to be treated as a single term. We run the phrasing algorithm of Mikolov et al. [Mikolov2013]
on the pre-processed corpus twice to detect n-grams, where n = [2,4]. The first run detects only bigrams by employing a higher threshold value, while the second run can detect n-grams up to n = 4 by using a lower threshold value to enable combinations of bigrams. Via this procedure of repeating the phrasing process with decreasing threshold values of , we detected phrases that appear more frequently in the first step using the higher threshold value, e.g., “autonomous vehicle”, and detected phrases that are comparatively less frequent in the second step using the lower threshold value, e.g., “autonomous vehicle platooning”. In this study, we used the best performing thresholds (5, 2.5) found in a previous study [Sarica2020].
The phase detection computation resulted in a vocabulary of 15,435,308 terms, including 13,730,320 phrases. Since the adopted phrase detection algorithm is purely based on cooccurrence statistics, the detection of some faulty phrases including stopwords such as “the_”, “a_”, “and_”, and “to_” is inevitable. Therefore, the detected phrases are processed one more time to split the known stopwords from the NLTK [bird2009natural] and USPTO [USPTO] stopwords lists. For example, “an_internal_combustion_engine” is replaced with “an internal_combustion_engine”. Then the vocabulary is reduced to 8,641,337 terms, including 6,900,263 phrases.
Next, all the words are represented with their regularized forms to avoid having multiple terms representing the same word or phrase and thus decrease the vocabulary size. This step is achieved by first using a part-of-speech (POS) tagger [Toutanova2007] to detect the type of words in the sentences and lemmatize those words accordingly. For example, if the word “learning” is tagged as a VERB, it would be regularized as “learn” while it would be regularized as “learning” if it is tagged as a NOUN. The lemmatization procedure further decreased the vocabulary to 8,144,852 terms including 6,418,992 phrases.
As a last step, we removed the words contained in famous NLTK [bird2009natural] and USPTO [USPTO] stopwords lists. The NLTK stopwords list focuses more on general stopwords that can be encountered in daily English language such as “a, an, the, …, he, she, his, her, …, what, which, who, …”, in total 179 words. On the other hand, USPTO stopwords list include words that occur very frequently in patent documents and do not contain critical meaning within patent texts, such as “claim, comprise, … embodiment, … provide, respectively, therefore, thereby, thereof, thereto, …”, in total 99 words. The union of these two lists contains 220 stopwords.
Additionally, we also discarded the words appearing only 1 time in the whole patent database, which leads to a final set of 6,645,391 terms including 5,834,072 phrases.
3.2 Term Statistics
To identify the frequently occurring words or phrases that carry little information content about engineering and technology, we use four metrics together: 1) direct term frequency (TF), 2) inverse-document frequency (IDF), 3) term-frequency-inverse-document-frequency (TFIDF) and 4) Shannon’s information entropy [Shannon1948].
We use to denote direct frequency of term . Consider a corpus of patents.
where is the number of terms in the patent , is total count of term in all patents. The term frequency is an important indicator of commonality of a term within a collection of documents. Stopwords are expected to have high term frequency.
Inverse-document-frequency (IDF) is calculated as follows
where is the number of patents containing term and represents the number of patents in the database. This metric penalizes the frequently occurring terms and favours the ones occurring in a few documents only. The metric’s lower bound is 0 which refers to the terms that appear in every single document in the database. The upper bound is defined by the terms appearing only in one document, which is .
Term frequency-inverse-document-frequency (TFIDF) is calculated as follows
This metric favours the terms that appear in a few documents, with a considerably high term frequency within the document. If a term appears in many documents, its TFIDF score will be penalized by IDF score due to its commonality. Here, we did not use the traditional IDF metric but removed the log normalizing function to penalize the terms commonly occurring in the entire patent database harder regardless of their in-document (patent) term frequencies. We eventually used the mean of the single document TFIDF scores for each term.
The entropy of term is calculated as follows. The metric indicates how uneven the distribution of term is in the corpus .
where is the distribution of term t over patent documents. This indicates how evenly distributed a term is in the patent database. Maximum attainable entropy value for a given collection of documents is basically an even distribution to all patents which leads to . Therefore, the terms having higher entropy values will contain less information about the patents where they appear, compared to other terms with lower entropy.
We reported the distributions of terms in our corpus according to these four metrics in the Appendix (see Figure A1). The term-frequency distribution has a very long right tail, indicating most of the terms appear a few times in the patent database while some words appear so frequently. Our further tests found that the distribution follows the a power law [Zipf1936, Zipf1949]. By contrast, the distribution by IDF has a long left tail, indicating the existence of a few terms that appears commonly in all patents. The TFIDF distribution also has a long right tail that indicates the existence of highly common terms in each patent and highly strong domain-specific terms dominating a set of patents. Moreover, the long right tail of entropy distribution indicates comparingly few high valued terms that are appearing commonly in the entire database. Therefore, assessing the four metrics together will allow us to detect the stopwords with varied occurrence patterns.
3.3 Human Evaluation
We formed 4 different lists of terms sorted by their decreasing TF, increasing IDF, increasing TFIDF, and decreasing entropy. Table A1 in the appendix presents the top ranked 30 terms in respective lists. Then the top 2,000 terms in each of the four lists are used to form a union set of terms. The union only includes 2,305 terms, which indicates that the lists based on four alternative statistic metrics overlap significantly. Then the terms in the union set are evaluated by two researchers with more than 20 years of engineering experience each, in terms of whether a term carries information about engineering and technology, to identify stopwords. The researchers initially achieved an inter-rater reliability of 0.83 [Cronbach1951] and then discussed the discrepancy to reach the consensus on a final list of 62 insignificant terms.
3.4 Final List
This list, compared to our previous study which identified a list of stopwords [Sarica2020] (see Table A2 in the Appendices) by manually reading 1,000 randomly selected sentences from the same patent text corpus, includes 26 new uninformative stopwords that the previous list did not cover. In the meantime, we also found the previous list contains other 25 stopwords, which are still deemed qualified stopwords in this study. Therefore, we integrate these 25 stopwords from the previous study with the 62 stopwords identified here to derive a final list of 87 stopwords for technical language analysis. The final list is presented in Table 1 together with the NLTK stopwords list and the USPTO stopwords list111This list can be downloaded from our GitHub repository https://github.com/SerhadS/TechNet. It is suggested to apply the three stopwords lists together in technical language processing applications across technical fields.
4 Concluding Remarks
To develop a comprehensive list of stopwords in engineering and technology-related texts, we mined the patent text database with several statistical metrics from term frequency to entropy together to automatically identify candidate stopwords and use human evaluation to validate, screen and finalize stopwords from the candidates. In this procedure, the automatic data-driven detection of four statistic metrics yield highly overlapping results, and the human evaluations also came with high inter-rater reliability, suggesting evaluator independence. Our final stopwords list can be used as a complementary list to NLTK and USPTO stopwords lists in NLP and text analysis tasks related to technology, engineering design, and innovation.