Identification, Tracking and Impact: Understanding the trade secret of catchphrases

Understanding the topical evolution in industrial innovation is a challenging problem. With the advancement in the digital repositories in the form of patent documents, it is becoming increasingly more feasible to understand the innovation secrets – "catchphrases" of organizations. However, searching and understanding this enormous textual information is a natural bottleneck. In this paper, we propose an unsupervised method for the extraction of catchphrases from the abstracts of patents granted by the U.S. Patent and Trademark Office over the years. Our proposed system achieves substantial improvement, both in terms of precision and recall, against state-of-the-art techniques. As a second objective, we conduct an extensive empirical study to understand the temporal evolution of the catchphrases across various organizations. We also show how the overall innovation evolution in the form of introduction of newer catchphrases in an organization's patents correlates with the future citations received by the patents filed by that organization. Our code and data sets will be placed in the public domain soon.



There are no comments yet.


page 7

page 8


Enterprise System Lifecycle-wide Innovation

Enterprise Systems purport to bring innovation to organizations. Yet, no...

Using Guilds to Foster Internal Startups in Large Organizations: A case study

Software product innovation in large organizations is fundamentally chal...

Proposal of a standard of Knowledge Management and Technological Innovation for Mexico

The purpose of this work is to offer a methodology that allows to constr...

Global Transfers: M-Pesa, Intellectual Property Rights and Digital Innovation

In July 2020, in the midst of the COVID crisis, the Kenyan mobile operat...

Innovation Representation of Stochastic Processes with Application to Causal Inference

Typically, real-world stochastic processes are not easy to analyze. In t...

Predicting Research that will be Cited in Policy Documents

Scientific publications and other genres of research output are increasi...

Formal Definitions of Unbounded Evolution and Innovation Reveal Universal Mechanisms for Open-Ended Evolution in Dynamical Systems

Open-ended evolution (OEE) is relevant to a variety of biological, artif...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As software and other products are becoming more complex, the number and size of patent documents are increasing gradually. Automated patent document processing systems are essential to extract information and gain insights from this ever-increasing collection of patent databases. Catchphrases provide a concise representation of the content of a document. A catchphrase is a well-known word or phrase encapsulating the particular concept or subject of a document. They contain all the important legal and technical aspects, instead of just summarizing the document. They have numerous applications such as document categorization, clustering, summarization, indexing, topic search, quantifying semantic similarity with other documents, and conceptualizing particular knowledge domain of the document (Gopavarapu et al., 2016; Jones and Paynter, 1999). However, since only a small minority of documents have author-assigned catchphrases, and manual assignment of catchphrases to existing documents is time-consuming, the automation of the catchphrase extraction process is highly desirable. In the current study, catchphrases represent innovation topics. Figure 1 presents example catchphrases from two different patent abstracts.

In this paper, we propose an unsupervised method for the extraction of catchphrases from the abstracts of patents granted by the U.S. Patent and Trademark Office over the years. The key contributions of this paper are as follows.

  • We propose an unsupervised technique for catchphrase identification and ranking in patent documents.

  • We conduct robust evaluations and comparison against several state-of-the-art baselines.

  • As a secondary objective, we study the evolution of catchphrases present in the patents filed by various organizations over time.

  • We bring forth some of the unique temporal characteristics of these catchphrases and show how these are correlated to the overall future citation count of the patents filed by an organization.

  • The catchphrase evolution study further unfolds that companies get polarized based on whether the patent documents keep re-using the same catchphrases over time or they introduce newer catchphrases as time progresses.

ID: US06681004
Abstract: The telephone memory aid provides a database to a primary party for storing and retrieving personal information about a secondary party, including summary information related to communication exchanges between the primary and secondary parties. The summary information includes, for example, the date and time of prior telephone calls and the topics discussed. This secondary party information, including the summaries of prior telephone calls, is available for review by the primary party during future phone calls with the secondary party. The telephone memory aid also facilitates entry of information into the database through speech recognition algorithms and through question and answer sessions with the primary and secondary parties.
ID: US06680003
Abstract: The present invention concerns chiral doping agents allowing a modification to be induced in the spiral pitch of a cholesteric liquid crystal, said doping agents including a biactivated chiral unit at least one of whose functions allows a chemical link to be established with an isomerisable group, for example by radiation, said group possibly having a polymerisable or co-polymerisable end chain. These new chiral doping agents find application in particular in a color display.

Figure 1. Example abstracts from USPTO patents US06681004 and US06680003. The highlighted set of words are identified as catchphrases from IPC (described in Section 5).

2. Related Work

A variety of techniques have been applied for automated keyword extraction like locating important phrases by analyzing markups like capitalization, section headings and emphasized texts 

(Krulwich and Burkey, 1997); building phrase dictionary by parts-of-speech (POS) tagging of word sequences (Larkey, 1999); thesaurus-based keyphrase indexing (Witten and Medelyan, 2006); domain-specific keyphrase extraction (Frank et al., 1999; Nguyen and Phan, 2009) and several other supervised methods such as KEA (Witten et al., 2005), MAUI (Medelyan et al., 2009), back-of-the-book indexing using catchphrase extraction (Csomai and Mihalcea, 2008), MAUI with text denoising (Shams and Mercer, 2012), CSSeer (Chen et al., 2013)

etc. In recent years artificial neural networks (ANNs) are being used to build predictive models to rank words in a document 

(Boger et al., 2001) and then select keywords based on these ranks.

It has been widely recognized that the innovative capability of a firm is a critical determinant of its performance and competitive edge (Bettis and Hitt, 1995; Helfat and Peteraf, 2003; Greenhalgh and Longland, 2005). Since patents are a direct outcome of the inventive process and are broken down by technical fields, they are considered indicators of not only the rate of the innovative activities of a firm but also its direction (Bloom and Van Reenen, 2002; Archibugi and Planta, 1996; Artz et al., 2010). Many previous studies have examined the relationships between the patenting activities of a company and its market value (Oh et al., 2012; Hall et al., 2005)Bornmann and Daniel (2008) precisely reviews the citing behavior of scientists and shows the role of citations as a reliable measure of impact.  Cheng et al. (2009) shows that some indicators of patent quality are statistically significant to return on assets.  Lee et al. (2012) assesses future technological impacts by employing the future citation count as a proxy while Lee et al. (2018) employs various patent indicators such as novelty and scope, as features of an ANN for early identification of emerging technologies.

3. Datasets and Preprocessing

The current study requires a rich time-stamped dataset. We, therefore, leverage two independent data sources. These are:

  1. The patent dataset: We compile the first dataset by crawling the full-text patent articles, available at the United States Patent and Trademark Office (USPTO111 It comprises patents granted weekly (Tuesdays) from January 1, 2003, to May 18, 2018 (excluding images/drawings). The patents are available as XML encoded files with English as the primary language. Out of all the curated documents, in this study, we only consider those patents for which the abstract information is present (see Table 1 for statistics).

  2. The newsgroup corpus: We also use another data source, the 20 Newsgroups Dataset222 donated by T. Mitchell in 1999. It includes one thousand Usenet articles each from 20 newsgroups like ’alt.atheism’, ’’, ’talk.politics.guns’, etc. Approximately 4% of the articles are crossposted.

    This serves as a non-patent corpus to estimate the importance of a word specifically in the domain of the patents concerning a non-patent domain (see Table 

    1 for statistics).


Year range 2003–2018
Number of patents 3,915,639
Number of patents with abstract 3,486,866


Year range 1993–2017
Number of articles 19,997
Number of words
Language English
Table 1. General statistics about the patent dataset and the newsgroup corpus. A large fraction (89%) of patents have abstract information.

Pre-processing: For both of the above, we performed several pre-processing tasks such as a sentence to lowercase conversion, removal of special characters except apostrophe and periods, lemmatization, and multiple white-spaces removals.

4. Catchphrase extraction

Catchphrase extraction is a challenging problem mainly due to the diversity and unavailability of large-text annotated datasets. We, therefore, present an unsupervised method for catchphrase extraction. We propose a two-stage extraction strategy that identifies relevant candidate catchphrases in a given patent article. In the first stage, we select the candidate catchphrases. This is followed by candidate catchphrase ranking in the second stage. Next, we describe the two stages in detail.

4.1. Stage-1: Candidate selection

In the first stage, we select candidate catchphrases from each patent’s abstract. Empirically, we observe that all catchphrases are n-gram noun phrases, for example, unigrams (e.g.

communication, dielectrometry, etc.), bigrams (e.g. consecutive bit, voice synthesizer, etc.), trigrams (e.g. integrated circuit device, hydrogen chloride gas, etc.) or quadrigrams (e.g. commercially available synthesis tool, electric signal processing board, etc.). We, therefore, perform part-of-speech-tagging (POS) of each abstract text to identify noun phrases. Currently, we leverage python’s state-of-the-art NLP library SpaCy333 Note that, we experimented with two text processing approaches before noun phrase identification: (i) with stopwords (WS), and (ii) without stopwords (WOS). WS represents that no stopwords were removed from the abstracts, whereas, WOS represents that all stopwords in the abstract text were removed beforehand. Abstracts with stopwords (WS) led to better quality extraction results due to the existence of stop-words in noun phrases. We discuss the results in detail in Section 5. Table 2 presents statistics of extracted candidate phrases from the dataset.

Word n-grams Count
Unigrams 208,105
Bigrams 2,616,762
Trigrams 4,432,251
Quadrigrams 2,138,696
Total 9,395,814
Table 2. Count of n-gram noun phrases generated from patent dataset.

4.2. Stage-2: Candidate ranking

Candidate phrases obtained in the first stage are ranked in this stage. The ranking algorithm is based upon two empirical findings: (i) how well the phrase describes the document’s topic, and (ii) how specific is the phrase to the patent literature. Our proposed method unifies both of these findings by combining a frequency-based measure with an information-theoretic measure. Given a patent document and a set of candidate phrases obtained in the previous stage, we compute the phrase score for each phrase .


where, denotes the term in an n-gram candidate phrase , score() denotes the score of the term by estimating the importance of the term specifically in the patent domain relative to a non-patent domain and

represents the Kullback-Leibler divergence informativeness specifying how well a candidate phrase

represents a document . The term in the above equation is computed as


Again, here, and represents the patent collection and non-patent (in our case, the newsgroup) collection. The importance of a term in a given collection is measured in terms of the collection frequency and the document frequency . represents how many times the term appeared in the entire collection . represents the count of documents where the term appeared. It is computed as


denotes an information theoretic measure to compute how informative the phrase is in the given document . It is computed as:


where, represents how many times appeared in document . denotes how many times appeared in the entire patent collection . and represents total number of n-grams in document and respectively.

The above scoring method results in a ranking of candidate phrases. We select top-ranked candidates such as top-5, top-10, top-20, etc., and evaluate our unsupervised method in the next section.

5. Experiments

In this section, we describe the experimental settings, baselines and the evaluation metrics. We construct a collection of possible catchphrases from the International Patent Classification (IPC) list. This list is maintained by the

World Intellectual Property Organization (WIPO)444 The IPC provides a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain. The hierarchy comprises eight high-level categories:

  1. Cat-1: Human necessities

  2. Cat-2: Performing operations; Transporting

  3. Cat-3: Chemistry; Metallurgy

  4. Cat-4: Textiles; Paper

  5. Cat-5: Fixed constructions

  6. Cat-6: Mechanical engineering; Lighting; Heating; Weapons; Blasting

  7. Cat-7: Physics

  8. Cat-8: Electricity

In each of these high-level categories, several sub-categories exist. An n-gram phrase represents each category. We term these phrases as ground truth catchphrases (GTC). Overall, we obtained 22,855 GTC such as ”actuators”, ”cleaning fabrics”, ”feedback arrangements in control systems”, etc. We use GTC to evaluate our proposed catchphrase extraction method. Table 3 presents examples of GTC for each high-level category. Next, we present three state-of-the-art baselines.

Category Unigrams Bigrams Trigrams Quadgrams
Cat-1 rhinoscopes dental surgery table service equipment foodstuffs containing gelling agents
Cat-2 thwarts rivet hearths making plough shares making plastics bushes bearings
Cat-3 riboflavin septic tanks acetone carboxylic acid chromising of metallic material surfaces
Cat-4 carding carbon filaments opening fiber bales drying wet webs in paper-making
Cat-5 collieries suspension bridges setting anchoring bolts freezing for sinking mine shafts
Cat-6 thermal diesel engines portable accumulator lamps treating internal-combustion engine exhaust
Cat-7 ozotypy investigating abrasion measuring electric supply incineration of solid radioactive waste
Cat-8 rheostats electric accumulator thermo magnetic devices electric amplifiers for amplifying pulse
Table 3. Examples of ground truth catchphrases for each high-level category available in the International Patent Classification (IPC) list.

5.1. Baselines

  1. Keyphrase extraction algorithm (KEA): KEA (Witten et al., 2005)

    is a supervised machine learning toolkit that extracts keyphrases and ranks them. The original algorithm was trained on scientific documents and uses a trained Naïve Bayes model. We trained KEA for patent documents leveraging a similar training procedure.

  2. LegalMandal et al. (2017) also follow an unsupervised approach for identification of catchphrases from legal court cases. The scoring is done as:


    where and can be calculated using equations 2 and 4 respectively. Note the change in the formula in equation 5 compared to equation 1. This modification as we shall see almost doubles our performance.

  3. KLIPTomokiyo and Hurst (2003); Verberne et al. (2016) proposed a Kullback-Leibler (KL) divergence based phrase assignment score which is a linear combination of two different scores:

    1. KL informativeness (KLI): KLI measures how well a candidate phrase represents a document. It is computed using equation 4.

    2. KL phraseness (KLP): KLP score is computed specifically for multi-word phrases. It compensates for low frequency of multi-word phrases by assigning higher weights to longer phrases:


    where, is the term of the phrase , and is the frequency of the term in document .

  4. BM25: BM25 (Robertson et al., 2009) is a well-known measure for scoring documents with respect to a given query. We use this function for assigning score to an extracted candidate phrase in a given document . The scoring function is:


    where is the term frequency of phrase in the document . and are free parameters. We choose [1.2, 2.0] and = 0.75555We select these values as per previous literature (Robertson et al., 2009).. is the inverse document frequency of the candidate phrase , calculated as


    where is the document frequency of the phrase in the collection.

Note that KEA is a supervised machine learning model whereas Legal, KLIP and BM25 are unsupervised methods.

5.2. Evaluation measures

We evaluate our proposed method against the three baselines. We use two standard evaluation measures: (i) Macro precision, and (ii) Macro recall. These metrics are computed by macro-averaging the precision/recall values computed for every patent.


where and are the precision and recall values computed for patent in our test dataset . The precision and recall values for the patent are computed as follows


where , , and represents the number of catchphrases in the patent that are detected, detected and present in GTC, and present in GTC respectively.

As KEA requires training, we partition our dataset into two classes: (i) train and (ii) test. Train split consist of 2,055,588 (65%) patent documents. Test split consist of 1,106,883 (35%) patent documents. For a fair comparison, we evaluate our proposed method against baselines (described in Section 5.1) using only the test split.

5.3. Results and discussion

Table 4 compares our proposed catchphrase extraction approach against state-of-the-art baselines. We outperform all baselines by a substantially high margin. The second best system in terms of precision is KEA, whereas the second-best system in terms of recall is a mix between KLIP and KEA. The baseline Legal performed worst among all the baselines, which is possible because of the fact that the authors take a logarithm of the sum of all the scores rather than the sum of the logarithms of the scores. The former measure undermines the contribution of the scores from each term and is therefore ineffective and is rather unintuitive.

Our Model KEA Legal BM25 KLIP Our Model KEA Legal BM25 KLIP
10 0.253 0.192 0.075 0.080 0.120 0.773 0.557 0.255 0.265 0.386
15 0.231 0.148 0.128 0.131 0.146 0.910 0.566 0.559 0.567 0.623
20 0.217 0.133 0.156 0.156 0.164 0.945 0.568 0.750 0.749 0.772
15% 0.260 0.200 0.056 0.060 0.108 0.750 0.555 0.172 0.185 0.323
20% 0.240 0.156 0.109 0.111 0.132 0.886 0.563 0.448 0.457 0.528
Table 4. Comparison of our proposed method against the baselines: Precision and recall values at different top-ranks () of extracted catchphrases.

6. Temporal study

In this section, we intend to show the usability of catchphrase extraction. We claim that catchphrase evolution presents a fair understanding of the changing innovation trends of companies. We conduct several interesting temporal studies to understand the emergence of new research topics in the industry. In this study, we select top-10 companies from three industrial segments: (1) Software666, (2) Hardware777, and (3) Mobile Phones888 Table 5 presents the list of top-10 companies in each of the above three segments.

Software Hardware Mobile Phone
Microsoft Apple Samsung
Google Samsung Apple
IBM IBM Huawei
Oracle Foxconn Oppo
Facebook Hewlett Packard Vivo
Tencent Lenovo Xiaomi
SAP Fujitsu OnePlus
Accenture Quanta Computer Lenovo
TCS AsusTek Nokia
Baidu Compal LG
Table 5. Top-10 Software, Hardware, and Mobile Phone companies selected from three publicly available lists.

In subsequent sections, we analyze patents filed by these companies over the years. In our patent dataset, each company can have several variations in name due to multiple research groups, geographical locations, subsidiaries, headquarters, etc. For example, IBM is present as ‘International Business Machines Corporation Armonk’, ‘International Business Machines Laboratory Inc.’, etc. We overcome these inconsistencies by manually annotating name variations. However, we claim that basic string matching techniques can easily automate this normalization. Besides, we eliminate frequently occurring catchphrases like, ’method’, ’present invention’, etc., to ignore noisy/redundant signals. This filtering process was automated by removing catchphrases with top-10 document frequencies. We next, present how catchphrases can be leveraged in understanding the topical evolution of companies.

6.1. Topic evolution

In this section, we study the topical evolution of companies. We leverage the Jaccard Similarity (JS) between the catchphrases to compute the topical overlap between patents filed in consecutive years by a specific company. We conduct this experiment for 11 years between 2006–2016. Figure 2 shows temporal profiles of a three-year moving average over JS for each of the three segments. We observe that Baidu in Hardware segment while Oppo, Vivo, and OnePlus in Mobile Phones segments exhibit relatively low similarity between catchphrases over the years. However, most of the companies have similarity curves with multiple peaks with an overall increase in the JS values over the years. For this analysis, we only considered 2-gram catchphrases. However, we found similar observations for higher n-gram catchphrases. If an organization is filing patents on the same topics over the years, the JS value will only increase; on the other hand, if an organization is continuously filing patents on newer topics, the JS value is expected to decline.

Figure 2. Moving average of catchphrases similarity between consecutive years for – Software (left), Hardware (center), and Mobile Phone (right) companies.

6.2. Categorization

Further, we conduct a nuanced study to understand this temporal behavior. We classify each company’s similarity profile into five categories 

(Chakraborty et al., 2015) based on the number and location of peaks. A peak in the similarity profile of a company represents a high topical similarity between consecutive years followed by a topical drifting off period. We leverage the peak identification method proposed by Chakraborty et al. (2015). Note that peaks occurring in consecutive years are considered as a single peak. The categories are:

  1. MonInc: Similarity profile that monotonically increases. The peak occurs in the last year.

  2. MonDec: Similarity profile that monotonically decreases. The peak occurs in the first year.

  3. PeakInit: Similarity profile that consists single peak within the first three years but not the first year.

  4. PeakLate: Similarity profile that consists single peak after the initial three years but not the last year.

  5. PeakMult: Similarity profile consisting of multiple peaks.

  6. Others: Similarity profiles that do not qualify into the above categories are kept in this category. They mainly consist of profiles with extremely low JS values for each year.

Table 6 shows categorization results. We find no company in MonDec and PeakInit categories. Majority of the companies are present in the PeakMult category followed by PeakLate category. Companies in Others category have very less number of filed patents. Three out of four companies in Others category are recently launched mobile companies.

Category Count Names
MonInc 4 Tencent, Samsung, Xiaomi, Lenovo
MonDec 0
PeakInit 0
PeakLate 6

Facebook, TCS, Huawei, AsusTek, Foxconn, Compal

PeakMult 12

HP, SAP, Accenture, Nokia, Fujitsu, Quanta Computer, Microsoft, IBM, Oracle, Google, Apple, LG

Others 4 Baidu, Oppo, Vivo, OnePlus
Table 6. Categorization of top-10 Software, Hardware, and Mobile Phone companies based on temporal catchphrase similarity profile. No company was classified in MonDec and PeakInit category.

Even though, PeakMult category consists multiple peaks, we observe two distinct fluctuation patterns. We term these patterns as (i) stable and (ii) unstable. In stable, the profile looks considerably less fluctuating. The profile highly fluctuates in unstable category. We quantify the above fluctuating patterns by leveraging the average value of JS. Given, JS(c) is the similarity profile for a company , average value of JS () is computed as:


Empirically, we observe that companies with can be classified as unstable, while the rest can be classified as stable. Table 7 shows companies in the PeakMult category that are further categorized into stable and unstable. Among, stable and unstable sub-categories, the former contains more (=7) companies than the latter (=5).

Company Category
Nokia 0.040 stable
Fujitsu 0.085 stable
Quanta Computer 0.069 stable
Microsoft 0.105 stable
Accenture 0.040 stable
SAP 0.048 stable
Hewlett Packard 0.084 stable
LG 0.223 unstable
Oracle 0.117 unstable
Google 0.121 unstable
Apple 0.197 unstable
IBM 0.124 unstable
Table 7. List of companies in PeakMult that are classified into stable and unstable sub-categories along with the average value of Jaccard Similarity () used for categorization.

6.3. Citation count

Citations, in the scholarly world, determine the popularity of research papers/authors/organizations. Here, we adopt a similar analogy for patent articles. A patent citation is a document cited by an applicant, third party, or a patent office examiner because its content relates to a patent application. We compute the citation count of a patent by summing the citations received by . For the current study, we construct citer-cited pairs by extracting references present in patent texts and use these pairs to compute patent citation counts.

Next, we create multiple citation zones based on the citation count of a patent. We define four distinctive zones: (i) very low, (ii) low, (iii) medium, and (iv) high, to study the influence of the JS profile of a company on the number of citations received by its patents. Table 8 presents zoning statistics of the complete dataset. Out of 3,829,153 patent articles, 1,499,175 have zero citation count.

Category Citation Count Patent Count
Very Low 0 1,499,175
Low 0 x 5 1,274,029
Medium x 25 840,461
High x25 215,488
Table 8. Patent citation zones with distinct citation count ranges.

Next, we relate similarity profiles and citation count zones. For each company, we measure the fraction of patents in different citation zones. We leverage histograms as a visualization tool to conduct this study. In Figure 3, we observe that the fraction of patents in Medium and High citation zones in PeakLate category are relatively higher than in MonInc category. This indicates that the introduction of diversity in topics over time helps in enhancing the future citations of the patents filed by a company.

Figure 3. Citation count zones vs similarity profiles: Fraction of patents in PeakLate (left) and MonInc (right) category companies in each citation count zone.

Figure 4 compares two subcategories of PeakMult. We observe that the fraction of patent falling under the Medium, and High citation zones in unstable category is relatively higher than stable categories implying that the companies with high fluctuations in similarity profiles perform better in terms of receiving citation counts. A possible explanation is that the companies with relatively specialized research domain file patents which attract lesser citations than the companies with diversified research domain.

Figure 4. Citation count zones vs similarity profiles: Fraction of patents in stable (left) and unstable (right) category companies in each citation count zone.

Lastly, we study Others category in Figure 5. Quite surprisingly, we observe that the fraction of patents in Medium and High citation zones in Others category is relatively higher than the rest of the categories described above in Figures 3 and 4.

Figure 5. Citation count zones vs similarity profiles: Fraction of patents in Others in each citation count zone.
Figure 6. Word clouds of the representative companies in different similarity profile based categories at three distinct years. (a) stable (Microsoft), (b) unstable (Oracle), (c) peaklate (Facebook), and (d) moninc (Samsung).

6.4. Catchphrases in the stable and unstable groups

In this section, we analyze the extent of usage of certain catchphrases (bigrams and trigrams) by a company. We rank the catchphrases based on document frequency, i.e, the number of patent documents a catchphrase is present in. Tables 9 and  10 show the top-10 bigrams for companies present in the stable and unstable groups respectively. Table 11 and  12 show top-10 trigrams for the same companies. Last, Table 13 notes the top-10 bigrams and trigrams from the entire stable and unstable categories taken together. While the stable group is concerned more about computer systems, the unstable group is more about electronic device parts.

HP MS SAP Accenture Nokia Fujitsu Quanta Computer
print job client device business process processing device user interface closed position circuit board
one aspect search result application server third party communication device inner surface display panel
printing system application program application program real-world environment one embodiment opposite side second image
second set user input software application mobile device computer program upper surface one side
operating system computing system business application invention concern telecommunication system longitudinal axis second end
second side search engine system method educational material telecommunication network another embodiment battery module
second position data store data structure computer-implemented method access point opposite end second position
second portion least portion system software communication network data transmission open position one end
display device subject matter user input synchronized video least part bottom surface portable computer
present disclosure client computer business object solution information second device side wall power supply
Table 9. Bi-grams with the top 10 document frequency values in STABLE category.
IBM Oracle Google Apple LG
top surface application server user interface integrated circuit second electrode
operating system operating system present disclosure second set one side
storage device data structure one example user input common electrode
computer program second set system method one example light source
second set software application example method first set lcd device
computing system one technique content item operating system control information
another embodiment source code user input another embodiment display device
user interface computer-implemented method one processor least portion array substrate
drain region another aspect user device host device drain electrode
data structure database object subject matter client device washing machine
Table 10. Bi-grams with the top 10 document frequency values in UNSTABLE category.
HP MS SAP Accenture Nokia Fujitsu Quanta Computer
storage area network host operating system first data object dual information system first base station user ’s head portable electronic apparatus
least one component client computing device one general aspect telecommunication industry taxonomy packet data network first second portion mobile communication device
fluid ejection assembly mobile communication device business process model contact center representative least one parameter user ’s foot second frequency band
first second set user ’s interaction second user input contact center system first network element least one opening third conductor arm
least one component least one implementation least one service context-appropriate enforcing completion user equipment due thinning spraying irrigation portable computer system
disclosed embodiment relate distributed computing system core software platform location-based service system wireless communication device patient ’s body second radiating element
least one surface application program interface least one attribute cognitive educational experience least one cell least one side blade server system
inkjet ink composition one computing device second data object individualized learning experience wireless communication system least one aperture service agent server
central processing unit client computer system one exemplary embodiment user ’s comprehension wireless communication device storied index rating printed circuit board
graphical user interface wireless access point related method system object recognition analysis second base station usda hardiness zone wireless communication device
Table 11. Tri-grams with the top 10 document frequency values in STABLE category.
IBM Oracle Google Apple LG
first conductivity type current result list one search result electronic device housing light guide plate
field effect transistor flexible extensible architecture first search result scrolling 3d manipulation digital broadcasting system
second dielectric layer computer program product image sensor interface intuitive hand configuration second semiconductor layer
gate dielectric layer distributed computing environment disclosed subject matter hand approach touch liquid crystal cell
data communication network graphical user interface image search result proximity-sensing multi-touch surface light emitting diode
integrated circuit device data storage system client computing device wireless communication circuitry main service data
direct physical contact data processing system client computing device antenna resonating element image display device
second conductivity type application programming interface second computing device computer readable medium serving base station
database management system database management system mobile communication device wireless communication system light emitting diode
buried insulator layer contention management mechanism distributed storage system wireless electronic device first second electrode
Table 12. Tri-grams with the top 10 document frequency values in UNSTABLE category.
Bi-grams Tri-grams Bi-grams Tri-grams
closed position first second portion top surface second semiconductor layer
another embodiment user’s head user interface printed circuit board
opposite side least one opening second set light guide plate
inner surface least one side operating system light emitting diode
upper surface user’s foot present disclosure digital broadcasting system
one aspect least one aperture least portion first semiconductor layer
longitudinal axis patient’s body another embodiment light emitting diode
open position thinning spraying irrigation system method liquid crystal cell
second position central processing unit data structure first second electrode
opposite end storie index rating computing system serving base station
Table 13. Bi-grams and Tri-grams with the top 10 document frequency values.

6.5. Temporal visualizations

In this section, we study the catchphrase evolution of companies. As a popular visualization tool, we leverage word clouds. We create word clouds for each company between the years 2003–2016. Due to space constraints, in Figure 6, we only consider word clouds for one representative company from stable, unstable, peaklate and moninc categories at three representative years. We claim that catchphrase evolution presents a fair understanding of the changing innovation trends of companies. Note that we consider only bigram catchphrases in this study. We can conduct a similar study for any company in different years999The detailed word clouds for all companies in our dataset are available at

In Figure 6a, we study catchphrase evolution for Microsoft (a representative company in the stable group). We observe a shift from traditional topics such as client-server models, databases, basic Web development, etc. (in 2003), toward full-fledged Web search and Internet technologies (in 2010). In 2016, the focus shifted to mobile devices and gesture identification. The above trends coincide with several product releases such as BING (a search engine released in 2009)101010 and Lumia (mobile phones released in 2015)111111

In Figure 6b, we study catchphrase evolution for Oracle (a representative company in the unstable category). Oracle seems to have shifted its focus from traditional database topics like relational databases, query, etc. (in 2003), toward the development of software as a service (SAAS) in 2010. In 2016, it continued to focus on services with a major emphasis on reliable authentication mechanisms in the cloud. These innovation trends resulted in several products like Oracle cloud (cloud computing service launched in 2016), Primavera (an enterprise project portfolio management software acquired by Oracle in 2008), etc.

Similarly, in Figure 6c, we study catchphrase evolution for Facebook (a representative company in PeakLate category). As Facebook started its operations from 2004, we present visualizations for three years, 2010, 2013 and 2016. The initial focus was to develop technical features like news feeds, membership, etc. In the year 2013, these trends shift toward instant messaging aspects. In the year 2016, the catchphrases show a distinct innovation pattern of restricting and disclosing data availability. Facebook Messenger (introduced in 2011) is one of the products developed between 2011–2013121212

We study Samsung as a representative company in MonInc category (see Figure 6d). Primarily Samsung’s major focus lies in traditional electronics innovation. Recent trends suggest an increased focus on mobile technologies such as user interfaces, display units, etc.

7. Conclusion and future work

In this paper, we propose an unsupervised catchphrase identification and ranking system. Our proposed system achieves a substantial improvement, both in terms of precision and recall, against state-of-the-art techniques. We demonstrate the usability of this extraction by analyzing how topics evolve in patent documents and how these evolution patterns shape the future citation count of the patents filed by a company.

In the future, we plan to extend the current work by developing an online interface for automatic catchphrase identification. We also plan to understand the influence of catchphrase evolution on the company’s revenue.


  • (1)
  • Archibugi and Planta (1996) Daniele Archibugi and Mario Planta. 1996. Measuring technological change through patents and innovation surveys. Technovation 16, 9 (1996), 451–519.
  • Artz et al. (2010) Kendall W Artz, Patricia M Norman, Donald E Hatfield, and Laura B Cardinal. 2010.

    A longitudinal study of the impact of R&D, patents, and product innovation on firm performance.

    Journal of product innovation management 27, 5 (2010), 725–740.
  • Bettis and Hitt (1995) Richard A Bettis and Michael A Hitt. 1995. The new competitive landscape. Strategic management journal 16, S1 (1995), 7–19.
  • Bloom and Van Reenen (2002) Nicholas Bloom and John Van Reenen. 2002. Patents, real options and firm performance. The Economic Journal 112, 478 (2002), C97–C116.
  • Boger et al. (2001) Zvi Boger, Tsvi Kuflik, Peretz Shoval, and Bracha Shapira. 2001. Automatic keyword identification by artificial neural networks compared to manual identification by users of filtering systems. Information Processing & Management 37, 2 (2001), 187–198.
  • Bornmann and Daniel (2008) Lutz Bornmann and Hans-Dieter Daniel. 2008. What do citation counts measure? A review of studies on citing behavior. Journal of documentation 64, 1 (2008), 45–80.
  • Chakraborty et al. (2015) Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, and Animesh Mukherjee. 2015. On the categorization of scientific citation profiles in computer science. Commun. ACM 58, 9 (2015), 82–90.
  • Chen et al. (2013) Hung-Hsuan Chen, Pucktada Treeratpituk, Prasenjit Mitra, and C Lee Giles. 2013. CSSeer: an expert recommendation system based on CiteseerX. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. ACM, 381–382.
  • Cheng et al. (2009) Yin-Hui Cheng, Fu-Yung Kuan, Shih-Chieh Chuang, and Yun Ken. 2009. Profitability decided by patent quality? An empirical study of the US semiconductor industry. Scientometrics 82, 1 (2009), 175–183.
  • Csomai and Mihalcea (2008) Andras Csomai and Rada Mihalcea. 2008. Linguistically motivated features for enhanced back-of-the-book indexing. Proceedings of ACL-08: HLT (2008), 932–940.
  • Frank et al. (1999) Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G Nevill-Manning. 1999. Domain-specific keyphrase extraction. In

    16th International joint conference on artificial intelligence (IJCAI 99)

    , Vol. 2. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 668–673.
  • Gopavarapu et al. (2016) Parthasarathy Gopavarapu, Line C Pouchard, and Santiago Pujol. 2016. Increasing datasets discoverability in an engineering data platform using keyword extraction. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 225–226.
  • Greenhalgh and Longland (2005) Christine Greenhalgh and Mark Longland. 2005. Running to stand still?–the value of R&D, patents and trade marks in innovating manufacturing firms. International Journal of the Economics of Business 12, 3 (2005), 307–328.
  • Hall et al. (2005) Bronwyn H Hall, Adam Jaffe, and Manuel Trajtenberg. 2005. Market value and patent citations. RAND Journal of economics (2005), 16–38.
  • Helfat and Peteraf (2003) Constance E Helfat and Margaret A Peteraf. 2003. The dynamic resource-based view: Capability lifecycles. Strategic management journal 24, 10 (2003), 997–1010.
  • Jones and Paynter (1999) Steve Jones and Gordon Paynter. 1999. Topic-based browsing within a digital library using keyphrases. In Proceedings of the fourth ACM conference on Digital libraries. ACM, 114–121.
  • Krulwich and Burkey (1997) Bruce Krulwich and Chad Burkey. 1997.

    The InfoFinder agent: Learning user interests through heuristic phrase extraction.

    IEEE Expert 12, 5 (1997), 22–27.
  • Larkey (1999) Leah S Larkey. 1999. A patent search and classification system. In Proceedings of the fourth ACM conference on Digital libraries. ACM, 179–187.
  • Lee et al. (2012) Changyong Lee, Yangrae Cho, Hyeonju Seol, and Yongtae Park. 2012. A stochastic patent citation analysis approach to assessing future technological impacts. Technological Forecasting and Social Change 79, 1 (2012), 16–29.
  • Lee et al. (2018) Changyong Lee, Ohjin Kwon, Myeongjung Kim, and Daeil Kwon. 2018. Early identification of emerging technologies: A machine learning approach using multiple patent indicators. Technological Forecasting and Social Change 127 (2018), 291–303.
  • Mandal et al. (2017) Arpan Mandal, Kripabandhu Ghosh, Arindam Pal, and Saptarshi Ghosh. 2017. Automatic Catchphrase Identification from Legal Court Case Documents. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2187–2190.
  • Medelyan et al. (2009) Olena Medelyan, Eibe Frank, and Ian H Witten. 2009. Human-competitive tagging using automatic keyphrase extraction. In

    Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3

    . Association for Computational Linguistics, 1318–1327.
  • Nguyen and Phan (2009) Chau Q Nguyen and Tuoi T Phan. 2009. An ontology-based approach for key phrase extraction. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 181–184.
  • Oh et al. (2012) Sooyoung Oh, Zhen Lei, Prasenjit Mitra, and John Yen. 2012. Evaluating and ranking patents using weighted citations. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 281–284.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  • Shams and Mercer (2012) Rushdi Shams and Robert E Mercer. 2012. Investigating keyphrase indexing with text denoising. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 263–266.
  • Tomokiyo and Hurst (2003) Takashi Tomokiyo and Matthew Hurst. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment-Volume 18. Association for Computational Linguistics, 33–40.
  • Verberne et al. (2016) Suzan Verberne, Maya Sappelli, Djoerd Hiemstra, and Wessel Kraaij. 2016. Evaluation and analysis of term scoring methods for term extraction. Information Retrieval Journal 19, 5 (2016), 510–545.
  • Witten and Medelyan (2006) Ian H Witten and Olena Medelyan. 2006. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’06). IEEE, 296–297.
  • Witten et al. (2005) Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-Manning. 2005. KEA: Practical Automated Keyphrase Extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI Global, 129–152.