Patents disclose what their creators consider valuable inventions - so valuable, in fact, that they spend a nontrivial amount of time and money on protecting them legally. Not only do patents define the extent of the legal protection, but they also describe in detail the invention and its embodiments, its relation to prior art, and contain metadata. It is common wisdom among patent professionals that up to 80% of the information in patents cannot be found elsewhere (Asche, 2017).
As a result, patents have been widely studied, with various aims. Recently, Natural Language Processing (NLP) approaches - which aim at automatically analizyng text - are emerging. This survey explores the application of NLP techniques to patent summarization, simplification, and generation. There are several reasons why we focus on these tasks: first of all, they have been explored less when compared, for example, to Patent Retrieval (Lupu and Hanbury, 2013; Shalaby and Zadrozny, 2019) and automatic patent classification (Gomez and Moens, 2014). However, their practical importance is hard to overstate: the volume of patent applications is enormous (according to the World Intellectual Property Organization, WIPO, 3.2 million patents were filed in 2019), and keeping pace with the technology is becoming difficult. One of the patents’ aims is to make knowledge circulate and accelerate the transfer of technology: however, this is hardly achievable given the information overload. Patent agents, R&D groups, and professionals would thus highly benefit from tools that digest information from patents or make them easier to process, given their length and complexity. The other reason is more technical and rises from patents’ peculiar linguistic characteristics. Patents are primarily textual documents, and they have proved an interesting testbed for NLP researchers. Interesting yet challenging: being a mixture of legal and technical terms, patents’ language differs severely from the general discourse.
Our contributions are the following: we present an analysis of patents’ linguistic characteristics and focus on the idiosyncrasies that negatively affect the use of off-the-shelf NLP tools (Section 2); after defining the patent summarization, simplification and generation tasks (Section 3) we describe the few available datasets and the evaluation approaches (Sections 4 and 5). Next, we review previous work in Sections 6, 7, and 8. Our review is rather comprehensive and covers works from the early 2000s to date. We pay special attention to the algorithms and models used from an NLP perspective. To the best of our knowledge, this is the first work that surveys summarization, simplification and generation techniques specifically in the patent domain. Note that, however, since patent processing has historically been application-oriented, previous work often used project-specific datasets, making it difficult to compare approaches directly in terms of performance. Finally, we present interesting lines of investigation for future research.
2 A primer on patents
Patents are primarily legal documents. Their owner controls the use of an invention for a limited time in a given geographic area and thus excludes others from making, using, or selling it without previous authorization. In exchange, the inventor discloses the invention to facilitate the transfer of technology.
This section defines some domain-specific concepts that we will reference in the following; we use patent US4575330A111https://patents.google.com/patent/US4575330A/en [Last accessed: March 2021] (the antecedent of a 3D printer, designed by Hull in 1989) as a running example.
2.1 Patent documents
Patent documents are highly structured and must follow strict rules 222WIPO Patent Drafting Manual (2007). URL: https://www.wipo.int/publications/en/details.jsp?id=297.. Typically, they contain the following sections:
E.g., Apparatus for production of three-dimensional objects by stereolithography
Specifies the extent of legal protection. This section can include multiple claims333We will refer to the whole document section using the cased form Claim, while the individual claims contained in such section will be lowercase. with a hierarchical structure.
A system for producing a three-dimensional object from a fluid medium capable of solidification when subjected to prescribed synergistic stimulation, said system comprising: means for drawing upon and forming successive cross-sectional laminae of said object at a two-dimensional interface; and means for moving said cross-sections as they are formed and building up said object in step wise fashion, whereby a three-dimensional object is extracted from a substantially two-dimensional surface.
An improved system for producing a three-dimensional object from a fluid medium capable of solidification when subjected to prescribed synergistic stimulation, said system comprising: […]
A system as set forth in claim 2, and further including: programmed control means for varying the graphic pattern of said reaction means operating upon said designated surface of said fluid medium.
Claims 1 and 2 are independent, while claim 3 is dependent on claim 2, which it further specifies. The document comprises 47 claims, which this paper is too small to contain. Following patent rules, each claim consists of a single sentence, therefore long, complex, and highly punctuated. The language is abstract to obfuscate the invention’s limitations and full in legal jargon.
A description detailed enough for a person skilled in the art444A “person skilled in the art” has ordinary skills in the invention technical field. For a formal definition, refer to the PCT International Search and Preliminary Examination Guidelines. to make and understand the invention.
Briefly, and in general terms, the present invention provides a new and improved system for generating a three-dimensional object by forming successive, adjacent, cross-sectional laminae of that object at the surface of a fluid medium capable of altering its physical state in response to appropriate synergistic stimulation, the successive laminae being automatically integrated as they are formed to define the desired three-dimensional object.
In a presently preferred embodiment, by way of example and not necessarily by way of limitation, the present invention harnesses the principles of computer generated graphics in combination with stereolithography, i.e., the application of lithographic techniques to the production of three dimensional objects, to simultaneously execute computer aided design (CAD) and computer aided manufacturing (CAM) in producing three-dimensional objects directly from computer instructions. […]
While the Claim section aims at legally protecting the invention (the construct in the mind of the inventor, with no physical substance), the Description discloses one or more embodiments (physical items). Drawings are standard in this section. The Description illustrates the invention to the public on the one hand and supports the Claim on the other. Notice how, while the language is still convoluted, it is less abstract.
Summarizes the invention description.
A system for generating three-dimensional objects by creating a cross-sectional pattern of the object to be formed at a selected surface of a fluid medium capable of altering its physical state in response to appropriate synergistic stimulation by impinging radiation, particle bombardment or chemical reaction, successive adjacent laminae […].
- Other metadata
Includes standard classification codes, prior art citations, relevant dates, and inventors’, assignees’, and examiners’ information.
- Patent classifications
Patents are classified using standard codes. The Patent Classification (IPC)555wipo.int/classifications/ipc/en/ [Last accessed: March 2021] and the Cooperative Patent Classification (CPC)666cooperativepatentclassification.org [Last accessed: March 2021] are the most widespread. Patent examiners assign codes manually depending on the invention’s technical characteristics. Patent US4575330A has 14 IPC classification codes. For example, code G09B25/02 indicates that the patent is in the Physics (G) section and follows to specify the class (G09), sub-class (G09B), group (G09B25/00), and sub-group (G09B25/02).
2.2 Patent language
In this section, we describe what makes patent documents unique from a linguistic perspective. Few documents are, in fact, as hard to process (for both humans and automatic systems) as patents, with their obscure language and complex discourse structure.
- Long sentences
According to patents’ rules, each claim must be written in a single sentence, which is therefore particularly long. Verberne et al. (2010) examined over 67 thousand Claim sections and found a median length of 22 and a mean of 55; note that this figure is highly underestimated, as the authors segment sentences using semi-columns in addition to full stops. In contrast, the British National Corpus median length is less than 10. For comparison, the first claim in patent US4575330A (a ”rather short” one) is 69 words long, while claim 2 contains 152 words. Shinmori et al. (2003) found similar characteristics in Japanese. While most quantitative work focuses on the Claim, sentences in other sections are also remarkably long.
- Words’ distribution and vocabulary
Claims do not use much lexicon not covered in general English, but their word frequency is different, and novel technical multi-word terms are createdad hoc (Verberne et al., 2010). Moreover, many words are used unusually: said, for example, typically refers back to a previously mentioned entity, repeated to minimize ambiguity (e.g., A system for […], said system comprising […], in claim 1); transitions (e.g., comprising, including, wherein, consisting) have specific legal meanings. The Claim’s language is abstract (system, object, medium in claim 1), not to limit the invention’s scope, while the Description is more concrete (Codina-Filbà et al., 2017).
- Complex syntactic structure
3 Task description
In this section, we will discuss the tasks of text summarization, simplification, and generation. We will define them from an NLP perspective and discuss their practical importance in the patent domain.
Loosely speaking, a summary is a piece of text that, based on one or more source documents, 1) contains the main information in such document(s) and 2) is shorter, denser, and less redundant. For a recent survey on text summarization, see (El-Kassas et al., 2021). Automatic summarization is an open problem in modern Natural Language Processing, and approaches vary widely. We will categorize previous work according to the following dimensions:
- Extractive vs. abstractive
Extractive summaries consist of sentences or chunks from the original document. To this end, most approaches divide the input into sentences and score their relevance. In contrast, abstractive approaches build an intermediate representation of the document first, from which they generate text that does not quote the input verbatim. Finally, hybrid systems take from both approaches; for example, they might select sentences extractively and then generate a paraphrased summary. Patent summaries have traditionally been extractive, but an interest in abstractive summarization is emerging.
- Generic vs. query-based
- Human- vs. machine-focused
While summaries are typically intended for humans, producing a shorter dense representation is equally relevant when the input is too long to be processed directly, e.g., by a machine learning algorithm. In this case, summarization constitutes a building block of a more complex pipeline.Tseng et al. (2007a, b), for example, perform summarization in view of patent-map creation and classification.
- Language-specific vs. multilingual
While published research has primarely been anglocentric, some works in other languages and multilingual techniques have been proposed.
As expected, patents’ summarization comes with its challenges. For example, while in some domains (e.g., news) the essential facts are typically in the first paragraphs, this assumption does not hold for patents, whose important content is spread in the whole input. Summaries also contain a high percentage of n-grams not in the source and shorter extractive fragments. Finally, summaries’ discourse structure is complex, and entities recur in multiple sentences. All these characteristics make patents an interesting testbed for summarization, for which a real semantic understanding of the input is crucial(Sharma et al., 2019).
In addition to the research interest, patents summaries are practically relevant for R&D teams, companies, and stakeholders. A brief search of online services showed that some companies sell patent summaries and related data as a paid service. For example, Derwent777https://clarivate.com/derwent [Last accessed: March 2021] produces patent abstracts distilling the novelty, use and advantages of the invention in plain English; to the best of our knowledge, the abstract is manually compiled by experts.
Automatic simplification reduces the linguistic complexity of a document to make it easier to understand. In contrast with summarization, all information is usually kept in the simplified text. Generally, approaches vary depending on the system’s target user (e.g., second-language learners, people with reading disorders, children). (Sikka and Mago, 2020) is a recent survey on text simplification. Given patents’ complexity - lexically and syntactically - the challenge lies in making their content accessible to the lay reader (which justifiably gets scared away from patents) and simplifying the experts’ work.
We will consider the following aspects:
- Expert vs. lay target users
Patents’ audience ranges from specialists (e.g., attorneys and legal professionals), to laypeople (including academics) that might be interested, for example, in the invention’s technical features. Depending on the target user (and, in turn, on the target task), the degree of simplification might vary. When considering the legal nature of patents, for example, special attention should be given to keeping their scope unchanged. The first claim of patent US4575330A, for example states: ”A system for producing […] comprising: means for drawing […]; and means for moving […].”. A system ”comprising” a feature might include additional ones; thus, replacing the term with ”consisting of” - which, in patent jargon, excludes any additional component - would be problematic, even if thesauruses treat the terms as synonyms888see, for example, Collins Online Thesaurus. Obviously, the attention to the jargon can be loosened if the target user is more interested in the technical characteristics than in the legal scope.
- Textual vs. graphical output
The simplification system’s output can be either a text or a more complex data structure. A textual output can be formatted appropriately (e.g., coloring essential words) (Okamoto et al., 2017), annotated with explanations (e.g., with links from a claim to a Description passage) (Shinmori and Okumura, 2004), or paraphrased (Bouayad-Agha et al., 2009). Alternatively, a graphical representation, in the form of trees or graphs - e.g., which highlights the relation among the invention components - can be used.
The simplification system can be designed with a specific application in mind: in (Okamoto et al., 2017), for example, authors designed an interface to help patent experts in comparing documents from the same patent family.
As in the case of summaries, designing appropriate simplification systems has interesting use cases. Suominen et al. (2018) performed a user study with both experts and laypeople: most of their participants considered patents difficult to read. When presented with various reading aids, most considered them useful. Even law scholars have called for the use of a simpler language in patents (Feldman, 2008). Commercially, companies that provide patent reports do so in plain language. Somewhat ironically, Derwent goes as far as replacing the document title with a less obscure one, of more practical use.
We will use Patent Generation to refer to methods that aim at generating a patent or part of it. To the best of our knowledge, this line of research is relatively new and is likely inspired by the recent success of modern generative models (e.g. GPT and its evolutions (Radford et al., 2018, 2019; Brown et al., 2020)) in various domains, including law (Huang et al., 2020), health (Amin-Nejad et al., 2020) and journalism (Shu et al., 2020), to name a few.
Some approaches only produce ”patent-like” text (i.e., employing technical terminology and respecting patents’ writing rules): their generation is unconstrained or constrained to a short user prompt - the first words of a text that the system needs to extend coherently. Their practical use is likely limited, but their success shows that even patents’ obscure language can be mastered by machines, at least at a superficial level. Another class of approaches conditions the generation to a fragment of the patent to produce a coherent output. For example, one might want to produce a plausible patent Abstract given its Title or a set of coherent claims with a given Description. In this case, the generation is constrained to the whole input section (e.g, the Title text) and the type of output section (e.g., Abstract).
While patent generation is still in its early days, researchers dream of ”augmented inventing” (Lee and Hsiang, 2020a), assisting inventors in redefining their ideas and helping with patent drafting. To this end, some hybrid commercial solutions exist already in the market999see, for example https://bohemian.ai/case-studies/automated-patent-drafting/, https://www.patentclaimmaster.com/automation.html, https://harrityllp.com/services/patent-automation/ [Last accessed: March 2021].
Patent documents are issued periodically by the responsible patent offices. The United States Patent and Trademark Office (USPTO), for example, publishes patent applications and grants weekly, along with other bibliographic and legal data101010developer.uspto.gov/data [Last accessed: March 2021]. To access the documents programmatically, Application Programming Interfaces (APIs) are available. PatentsView111111www.patentsview.org/ [Last accessed: March 2021], for example, is a visualization and mining platform to search and download USPTO patents, updated every three months. It provides several endpoints (patent, inventor, assignees, location, CPC, etc.) and a custom query language. Google also provides public datasets121212console.cloud.google.com/marketplace/browse?q=google%20patents%20public%20datasets&filter=solution-type:dataset [Last accessed: March 2021], accessible through BigQuery.
While massive datasets exist, few are annotated. Annotated data are of the greatest importance for supervised learning-based methods and provide a gold-standard for evaluation; moreover, having a set of shared benchmarks allows to directly compare approaches, which is much more difficult otherwise. The only large-scale dataset for patent summarization is BigPatent131313evasharma.github.io/bigpatent [Last accessed: March 2021] (Sharma et al., 2019). The dataset was recently built for abstractive summarization and contains 1.3 million patents’ Descriptions and their Abstracts (a.k.a. Summary of Description) as human-written references. While most previous work focuses on Claims’ summarization, no comparable Claim to summary dataset exists (nor would it be easy to obtain), and authors resort to expert-written summaries for evaluation.
For patent simplification, no simplified corpus exists to date.
The evaluation of a generated text, be it a summary, a simplification, or a completely new document, is currently an open problem in Natural Language Generation (Celikyilmaz et al., 2020). Qualitative approaches resort to humans to evaluate high-level characteristics (e.g., coherence, grammaticality, readability). In contrast, automatic approaches often measure the output similarity with human written gold-standards.
For patent summarization, qualitative evaluation involves experts and non-experts; Mille and Wanner (2008), for example, assess summaries intelligibility, simplicity, and accuracy on a Likert scale (Robinson, 2014). Quantitatively, the most widespread automatic summarization metrics is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004). It measures the overlap between the generated sentence and the gold-standard. ROUGE-N is n-gram based and is measured as:
Thus, a ROUGE-1 of 0.5 means half of the words in the reference are in the summary generated by the model. ROUGE-L measures the similarity in terms of the Longest Common Subsequence (LCS). Words of the LCS must appear in the same relative order but not necessarily be contiguous. ROUGE-1, ROUGE-2 (for relevance), and ROUGE-L (for fluency) are generally used in practice, as they best correlate with human judgment. Similarly, some studies measure the similarity between the generated text and the reference summary in uni-gram Precision, Recall, and . The Compression Ratio and the Retention Ratio (the percentage of original information kept in the summary) are also frequently reported. Finally, when summarization is part of a more complex pipeline, the relative improvement of the downstream task is considered.
When evaluating simplification approaches, two different points of view exist. The first only considers the method’s correctness: if the algorithm needs to segment the text, one can manually annotate a segmented gold-standard and measure accuracy. However, assessing the readability improvement requires qualitative studies. Suominen et al. (2018), for example, use a questionnaire for quantifying patents’ complexity and test simplification solutions. Following their work’s findings, experts’ and laypeople’s opinions should be analyzed separately, as they are concerned with different issues. For instance, experts worry that the simplified patent might be misrepresented and its legal scope changed while laypeople demand strategies to understand the invention and find information.
Finally, measuring the quality of generated patent text is generally tricky. When no gold-standard exists, some authors have introduced ad hoc measures (see, for example (Lee and Hsiang, 2020b)); when a human-written reference exists, metrics as ROUGE can be used. Finally, note that some studies criticize the use of ROUGE; Lee (2020), for example, also reports the results using the Universal Sentence Encoder (Cer et al., 2018) representation, which they speculate handles semantics better.
6 Approaches for patent summarization
In this section, we describe extractive and abstractive approaches to patent summarization. As we discussed already, their direct comparison is difficult, as publications tend to use slightly different tasks on unshared data.
6.1 Extractive summarization
Extractive approaches select the most informative sentences in the original document. A typical pipeline comprises the following steps:
Document segmentation: documents are split into segments, sentences, or paragraphs, using punctuation or heuristics.
Feature extraction: for each sentence, features include keywords, title words, cue words (from expert-designed lists), and the sentence position. Keywords might be single- or multi-terms. Trappey et al. (2006) use TF-IDF to extract keywords automatically and in (Trappey et al., 2008, 2009) add an ontology for domain-specific keyphrases. Tseng et al. (2007a) propose an algorithm that merges nearby uni-grams words and extracts maximally repeated strings as multi-word terms. The position might also be considered (favoring sentences at the beginning). Query-oriented approaches also measure the sentence similarity to the query (e.g., with overlapping words (Girthana and Swamynathan, 2020)), which can be further expanded using a domain ontology (Girthana and Swamynathan, 2019a) or general-domain resources (Girthana and Swamynathan, 2019b) like WordNet. Table 1 includes some frequent features in extractive patent summarization.
Sentence weighting: the extracted features are used to score the sentence relevance in the summary. For example, Tseng et al. score sentences as:
where TF is the term frequency of word w in sentence S, mean(TF) is the average term frequency over keywords and title words in S, and FS and P are the position weights, assigned heuristically. Another option is to learn weights from data directly: for example, Codina-Filbà et al. (2017) score each segment as
; they train a linear regression to learns features weights based on textual segments and their cosine similarity to the gold-standard. Lastly, sentences can be classified as relevant or not relevant: to this end,Girthana and Swamynathan (2019a, 2020)
train a Restricted Boltzmann Machine(Larochelle and Bengio, 2008) without supervision. To minimize repetitions, (Trappey et al., 2006; Trappey and Trappey, 2008; Trappey et al., 2008) cluster semantically similar sentences and only select one sentence per cluster.
While popular, the above pipeline is not the only route to extractive summarization. Alternatively, Bouayad-Agha et al. (2009) exploit the discourse structure, which they prune following predefined rules. Finally, Souza et al. (2019) discuss applying general-domain algorithms to patent sub-groups naming141414Patent sub-groups are the most specific level of the patents’ classification hierarchy and are named with a representative name, e.g. ”Extracting optical codes from image or text carrying said optical code”: in that context, LSA (Dokun and Celebi, 2015) performs best compared to LexRank (Erkan and Radev, 2004) and to a TF-IDF approach.
|Term frequency - Inverse Document Frequency||Measures a keyword importance|
|Coreference-chain based||Entities coreferenced repeatedly are more central|
|Query similarity||Relevance to the query|
|Length||Overly long segments might be discouraged|
|Number of keywords|
|Number of cue-words|
6.2 Abstractive models
Abstractive models exploit a semantic representation of the input. In the patent domain, the first approaches used deep syntactic structures. Mille and Wanner (2008) first simplify the claims (see (Bouayad-Agha et al., 2009)) to make parsing easier and then map the shallow syntactic structures to deep ones, using rules. Deep syntactic structures are closer to a semantic representation and thus used for summarization: to this end, the least relevant chunks are removed using handcrafted rules. Finally, they transfer the summarized deep structures to the target language (English, French, Spanish, or German) and use a generator to convert them to text.
More recently, neural models have revolutionized Natural Language Processing. These models act on the text directly and use neural networks to extract a representation optimized for the task to be solved. For abstractive summarization, a sequence-to-sequence model typically extracts a hidden representation from the input text (encoding) and then uses it to generate the output (decoding). While neural performance is indisputable, models require many input-output samples to learn from: that is probably why they have only spread very recently in the patent domain. No large-scale summarization dataset, in fact, existed before 2019, when BigPatent(Sharma et al., 2019) was published. Sharma et al. proposed several baselines: an LSTM (Sutskever et al., 2014) with attention (Bahdanau et al., 2015), a Pointer-Generator (See et al., 2017) with and without coverage, and SentRewriting (Chen and Bansal, 2018b) (a hybrid approach).
Given its differences with the previously available datasets (mostly in the news domain), BigPatent became an interesting testbed even for general domain NLP summarization models: this is the case of Pegasus (Zhang et al., 2020), a pre-trained transformer (Vaswani et al., 2017) for summarization. During pre-training, whole sentences from the input are masked, and the model needs to generate them from the rest of the input (Gap Sentence Generation).
One of the significant challenges of the dataset is the input length, which is very large (with a 90% percentile of 7693 tokens), and is problematic for standard transformers (whose attention mechanism scales quadratically in the input size): to this end, BIGBIRD (Zaheer et al., 2020) proposes a sparse attention mechanism which, to the best of our knowledge, is to date state of the art on the dataset.
|TextRank (Mihalcea and Tarau, 2004)||35.99||11.14||29.60|
|LexRank (Erkan and Radev, 2004)||35.57||10.47||29.03|
|SumBasic (Nenkova and Vanderwende, 2005)||27.44||7.08||23.66|
|RNN-ext RL (Chen and Bansal, 2018a)||34.63||10.62||29.43|
|LSTM seq2seq (Sutskever et al., 2014) + attention||28.74||7.87||24.66|
|Pointer-Generator (See et al., 2017)||30.59||10.01||25.65|
|Pointer-Generator + coverage (See et al., 2017)||33.14||11.63||28.55|
|SentRewriting (Chen and Bansal, 2018b)||37.12||11.87||32.45|
|TLM (Pilault et al., 2020)||36.41||11.38||30.88|
|TLM + Extracted sentences||38.65||12.31||34.09|
|CTRL (He et al., 2020)||45.80||18.68||39.06|
|Pegasus (Zhang et al., 2020) (no pretraining)||42.98||20.51||31.87|
|BIGBIRD-RoBERTa (base, MLM) (Zaheer et al., 2020)||55.69||37.27||45.56|
|BIGBIRD-Pegasus (large, Pegasus pretrain)||60.64||42.46||50.01|
Summarization models’ performance on the BigPatent dataset is shown in Table 2. Note how the pre-trained transformer models obtain the best results, in line with the general trend in Natural Language Processing.
Finally, summarization methods could also be used for solving specific patent tasks. CTRLsum (He et al., 2020), for example, is a system that allows controlling the generated text by interacting through keywords or short prompts. The authors experiment with inputting [the purpose of the present invention is] to retrieve and extract the patent aim. Finally, de Souza et al. (2021) have compared extractive and abstractive models in naming patents’ subgroups. When used to ”summarize” the Abstract to produce a patent Title - which should contain, similarly to its subgroup name, the essence of the invention - extractive methods were found superior. This result highlights the challenges met by abstractive models, which are likely to be magnified in the legal domain.
6.3 Hybrid models
Hybrid models integrate elements of extractive and abstractive summarization. For example, the TOPAS workbench (Brügmann et al., 2015) includes a module that first selects segments extractively and then paraphrases them. A similar approach was adopted in (Codina-Filbà et al., 2017). More recently, Pilault et al. (2020) have shown that adding previously extracted sentences to the input when training a language model helps with long dependencies and improves the model’s abstractiveness. While the models described so far train the extractive and the abstractive components separately, SentRewriting (Chen and Bansal, 2018b)
uses reinforcement learning for selecting salient sentences and train the model end to end.
In contrast with the previous works, Trappey et al. (2020) explore an abstractive to extractive approach. They use an LSTM with attention to guide the extraction of relevant sentences: it receives a set of English and Chinese documents (Title, Abstract, and Claim) and is trained to produce a human-written summary (abstractive component). After the training, the words with the highest attention weights are retrieved and treated as automatically-extracted keywords; sentences are then scored and extracted accordingly (extractive component).
7 Approaches for Patent simplification
Patents’ claims are the hardest section of an overall hard-to-read document. As such, a lot of effort has been spent in improving the accessibility and readability of the Claim. Given the Claim’s legal nature, however, the extent of the modification is crucial, and previous approaches’ views to the task have varied widely.
Ferraro et al. (2014), for example, aim at improving the Claim’s presentation without modifying its text. They segment each claim into preamble, transition, and body (rule-based) and then further divide the body into clauses using a Conditional Random Field. Knowing the elements’ boundaries, the claim can then be formatted more clearly, e.g., adding line breaks.
A somewhat opposite approach was taken in the PATExpert project (Wanner et al., 2008), which developed a rewriting and paraphrasing module (Bouayad-Agha et al., 2009). The researchers considered two levels of simplification: one uses surface criteria to segments the input and reconstructs chunks into shorter, easier-to-read sentences (Bouayad-Agha et al., 2009). The other (Mille and Wanner, 2008) is conceptually similar to (Mille and Wanner, 2008) for multilingual summarization: after shallow simplification and segmentation, patents are parsed and projected to Deep Syntactic Structures. This representation is in turn used to rewrite a text that is simpler to process for the reader (possibly in another language). Both approaches modify the patent text. Note how, in this framework, rewriting and summarization are essentially unified, with the key difference that no content is removed for simplification.
Instead of relying on linguistic techniques, (Okamoto et al., 2017) use an Information Extraction engine that detects entities types and their relations using distant supervision. They provide a visualization interface which a) formats each patent claims to improve readability: color is used to highlight the claim type (e.g., apparatus, method), the transaction, and technical components in the patent body; b) shows the Claim structure: for each claim they include its type, dependencies, and references to other technologies and components. They target patent experts, which might use the system to compare claims (e.g., in the same patent family) and search for similar documents.
The approaches described so far output a simplified and easier-to-read textual version of the original Claim. Another option is to visualize them in a structured way. Andersson et al. (2013), for example, obtain a connected graph of the claim content; each node contains a noun phrase (NP) and is linked through a verb, a preposition, or a discourse relation. Similarly, (Kang et al., 2018) constructs a graph for visualizing the patent content in the contest of an Information Retrieval pipeline. Sheremetyeva (2014) uses visualization on two levels: they first constructs a hierarchical tree of the whole Claim section (highlighting dependency relations) and simplifies each claim. In this phase, a tailored linguistic analysis is used (Sheremetyeva, 2003); the simplified claim is segmented in shorter phrases (whose NPs are highlighted and linked to the Description) and visualized as a forest of trees.
Note that most approaches do not measure the improvement in readability so that it is not clear how effective they are in enhancing intelligibility.
Finally, the Claim simplification problem was also studied for the Japanese language. In particular, Shinmori et al. propose a method to expose patent structure using manually-defined cue phrases (Shinmori et al., 2002) and explain invention-specific terms using the Description (Shinmori et al., 2003). In (Shinmori and Okumura, 2004), Description chunks are used to paraphrase corresponding sentences in the Claim and improve readability.
8 Approaches for Patent generation
The task of Patent generation has recently been investigated by Lee and Hsiang, which try to leverage state-of-the-art NLP models to generate patent text. Their early work (Lee and Hsiang, 2020a) fine-tunes GPT-2 - a language model which demonstrated impressive results in generating text from a wide range of domains - using patents’ first claim. Interestingly, only a small number of fine-tuning steps are sufficient to adapt the general domain model and produce patent-like text. However, the quality of the generation is not measured. This gap is partially filled in (Lee and Hsiang, 2020b), where a BERT classifier is used to measure if two consecutive spans, generated automatically, are consistent. They train the classifier on consecutive spans from the same patent (positive examples) and from non-overlapping classes and subclasses (negative examples), which might make the classification not particularly difficult (e.g., the model could relay in shallow lexical features). The generation process is further investigated in (Lee and Hsiang, 2020c), which, given a generated text, tries to find the most similar example in the generator’s fine-tuning data.
The models described above try to generate consistent text resembling a patent but whose topic and characteristics are not constrained. Lee (2020) takes a different route and trains the model to generate a patent’s sections (Title, Abstract, and claims) given other parts of the same patents. The model uses GPT-2, which receives as input the text on which to condition and learns to produce a section of the same patent accordingly. For example, one can input the Title of a patent and train the model to generate the corresponding Abstract. Two things should be noted: first, the authors frame the problem as self-supervised and use patents’ sections as gold-standard, which simplifies evaluation; second, the problem generalizes abstractive patent summarization, so that it might be interesting to study the performance obtained, e.g., generating the Abstract from the Description.
9 Current and future directions
This survey aimed at showing that patents are an interesting domain both for their practical importance and their linguistic challenges. While generative approaches for patents are still relatively niche topics, with few active groups, the domain is drawing attention from general NLP practitioners for its unique characteristics. In the following, we present some open issues which might be worthy of future research.
- Data, data, data
Labeled and annotated data are few in the patent domain. For summarization, the only available large-scale dataset is BigPatent (Sharma et al., 2019), while no simplified corpus (let alone parallel corpora) exists, to the best of our knowledge. Moreover, while BigPatent represented a milestone for patent summarization, the target Abstract is written in the typical arcane patent language; thus the practical usefulness of systems trained on these data is probably scarce for laypeople - which would rather prefer a ”plain English” abstract, like those provided by commercial companies. A dataset that targets a clearer summary (unifying summarization and simplification) would also help in understanding models’ capabilities in going beyond shallow features and have a global understanding of the source. Finally, while no public corpora of simplified patent text exist to date, other domains have exploited creative ploys for minimizing human effort: in the medical domain, for example, (Pattisapu et al., 2020) uses social media contents to create a simplified corpus.
There are many approaches to summarization and simplification. However, it is difficult to compare them given the absence of shared benchmarks. For extractive summarization, for example, many studies have only compared their results with a baseline or a general-domain commercial system. However, directly comparing the performance of different approaches is difficult, as they solve slightly different tasks on different datasets and often fail to report implementation details.
- Evaluation metrics
Generative approaches for patent often resort to general-domain metrics for evaluation (e.g. ROUGE). However, it is not clear how suitable these measures are for the patent domain, given its peculiarities. In the context of abstractive summarization and patent generation, some works (de Souza et al., 2021; Lee, 2020) highlight that ROUGE is unable to find semantically similar sentences expressed in different wording. In the context of Natural Language Generation, some new measures have recently been proposed to solve these issues. BERTScore (Zhang et al., 2020), for example, evaluates the similarity among the summary and gold-standard tokens instead of their exact match, while QAGS (Wang et al., 2020) uses a set of questions to evaluate factual consistency between a summary and its source (a reference is not needed). It is yet to be explored if these metrics could be applied to the patent domain successfully. Finally, note that even human studies are difficult in the patent domain, as they require a high expertise, which most people lack.
While neural abstractive models have shown impressive performance in summarization, they tend to fabricate information. Cao et al. (2018) studied the phenomenon in the news domain and found that around 30% of documents included fake facts. This behavior is particularly problematic in a legal context; ROUGE, however, is a surface metric and is unable to detect factual inconsistencies.
- Domain adaptation
Patents’ language hardly resembles general-discourse English (used in pre-training), but the domain adaptation problem has not been studied in detail. Among the previous works, Aghajanyan et al. (2021) propose a second multitask pre-training step, Chen et al. (2020) studies models cross domain performance and Fabbri et al. (2020) evaluates zero and few shot settings; all these works described applications to the patent domain, among the others.
- Input length
Patent documents are extremely long. For summarization, the only datasets which have comparable or longer inputs are the arXiv and the PubMed dataset(Cohan et al., 2018), which summarize entire research papers. While solutions to allow the processing of long inputs have been proposed, the in-depth study of methods and performance for such long documents is still in its early days. For neural models, a very long input translates into prohibitive computational requirements (e.g. several GPUs), which researchers have recently tried to mitigate by modifying the underlying architectures.
- Muppet: massive multi-task representations with pre-finetuning. CoRR abs/2101.11038. External Links: Cited by: item Domain adaptation.
- Exploring transformer text generation for medical dataset augmentation. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4699–4708 (English). External Links: Cited by: §3.3.
- Domain adaptation of general natural language processing tools for a patent claim visualization system. In Multidisciplinary Information Retrieval, M. Lupu, E. Kanoulas, and F. Loizides (Eds.), Berlin, Heidelberg, pp. 70–82. External Links: Cited by: Figure 3, §7.
- “80% of technical information found only in patents”–is there proof of this?. World Patent Information 48, pp. 16–28. Cited by: §1.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §6.2.
Simplification of patent claim sentences for their paraphrasing and summarization, 22nd, international florida artificial intelligence research society conference. In FLAIRS - PROCEEDINGS, International Florida Artificial Intelligence Research Society Conference, 22nd, International Florida Artificial Intelligence Research Society Conference, pp. 302–303. External Links: Cited by: §6.2, §7.
- Improving the comprehension of legal documentation: the case of patent claims. pp. 78–87. External Links: Cited by: item Textual vs. graphical output, §6.1, §7.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: §3.3.
- Towards content-oriented patent document processing: intelligent patent analysis and summarization. World Patent Information 40, pp. 30 – 42. External Links: Cited by: §6.3.
- The challenge of syntactic dependency parsing adaptation for the patent domain. Cited by: item Complex syntactic structure.
- Faithful to the original: fact-aware neural abstractive summarization. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 4784–4791 (English). Note: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018 Cited by: item Factuality.
- Evaluation of text generation: a survey. External Links: Cited by: §5.
- Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. External Links: Cited by: §5.
- Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 675–686. External Links: Cited by: Table 2.
- Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 675–686. External Links: Cited by: §6.2, §6.3, Table 2.
- CDEvalSumm: an empirical study of cross-dataset evaluation for neural summarization systems. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 3679–3691. External Links: Cited by: item Domain adaptation.
- Using genre-specific features for patent summaries. Information Processing & Management 53 (1), pp. 151 – 174. External Links: Cited by: item Words’ distribution and vocabulary, item 4, §6.3.
A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 615–621. External Links: Cited by: item Input length.
- A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 126, pp. 135–156. Cited by: §6.2, item Evaluation metrics.
- Single-document summarization using latent semantic analysis. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE) 1 (2), pp. 57–64. Cited by: §6.1.
- Automatic text summarization: a comprehensive survey. Expert Systems with Applications 165, pp. 113679. External Links: Cited by: §3.1.
- LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22 (1), pp. 457–479. External Links: Cited by: §6.1, Table 2.
- Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation. arXiv preprint arXiv:2010.12836. Cited by: item Domain adaptation.
- Plain language patents. Vol. 17, pp. 289. Cited by: §3.2.
- Segmentation of patent claims for improving their readability. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Gothenburg, Sweden, pp. 66–73. External Links: Cited by: Figure 1, §7.
Query Oriented Extractive-Abstractive Summarization System (QEASS).
Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, CoDS-COMAD ’19, New York, NY, USA, pp. 301–305. External Links: Cited by: item Generic vs. query-based, item 3, item 4.
- Semantic Query-Based Patent Summarization System (SQPSS). In Advances in Data Science, L. Akoglu, E. Ferrara, M. Deivamani, R. Baeza-Yates, and P. Yogesh (Eds.), Singapore, pp. 169–179. External Links: Cited by: item Generic vs. query-based, item 3.
- Query-Oriented Patent Document Summarization System (QPSS). In Soft Computing: Theories and Applications, M. Pant, T. K. Sharma, O. P. Verma, R. Singla, and A. Sikander (Eds.), Singapore, pp. 237–246. External Links: Cited by: item Generic vs. query-based, item 3, item 4.
- A survey of automated hierarchical classification of patents. In Professional search in the modern world, pp. 215–249. Cited by: §1.
- CTRLsum: towards generic controllable text summarization. arXiv preprint arXiv:2012.04281. Cited by: §6.2, Table 2.
- Generating reasonable legal text through the combination of language modeling and question answering. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 3687–3693. Note: Main track External Links: Cited by: §3.3.
- Text simplification of patent documents. In Automated Invention for Smart Industries, D. Cavallucci, R. De Guio, and S. Koziołek (Eds.), Cham, pp. 225–237. External Links: Cited by: §7.
Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 536–543. External Links: Cited by: item 4.
- Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information 62, pp. 101983. External Links: Cited by: §3.3, §8.
- PatentTransformer-1.5: measuring patent claim generation by span relevancy. In New Frontiers in Artificial Intelligence, M. Sakamoto, N. Okazaki, K. Mineshima, and K. Satoh (Eds.), Cham, pp. 20–33. External Links: Cited by: §5, §8.
- Prior art search and reranking for generated patent text. External Links: Cited by: §8.
- Controlling patent text generation by structural metadata. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3241–3244. External Links: Cited by: §5, §8, item Evaluation metrics.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §5.
- Patent retrieval. Found. Trends Inf. Retr. 7 (1), pp. 1–97. External Links: Cited by: §1.
- TextRank: bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Cited by: Table 2.
- Making text resources accessible to the reader: the case of patent claims. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. External Links: Cited by: §7.
- Multilingual summarization in practice: the case of patent claims. In Proceedings of the 12th Annual conference of the European Association for Machine Translation, Hamburg, Germany, pp. 120–129. External Links: Cited by: item Complex syntactic structure, §5, §6.2, §7.
- The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005 101. Cited by: Table 2.
- Applying information extraction for patent structure analysis. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pp. 989–992. External Links: Cited by: item Textual vs. graphical output, item Application, Figure 2, §7.
- Leveraging social media for medical text simplification. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, New York, NY, USA, pp. 851–860. External Links: Cited by: item Data, data, data.
- On extractive and abstractive neural document summarization with transformer language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9308–9319. External Links: Cited by: §6.3, Table 2.
- Language models are unsupervised multitask learners. Cited by: §3.3.
- Improving language understanding by generative pre-training. Cited by: §3.3.
- Likert scale. In Encyclopedia of Quality of Life and Well-Being Research, A. C. Michalos (Ed.), pp. 3620–3621. External Links: Cited by: §5.
- Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §6.2, Table 2.
- Patent retrieval: a literature review. Knowledge and Information Systems, pp. 1–30. Cited by: §1.
- BIGPATENT: a large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2204–2213. External Links: Cited by: §3.1, §4, §6.2, item Data, data, data.
- Natural language analysis of patent claims. In Proceedings of the ACL-2003 Workshop on Patent Corpus Processing - Volume 20, PATENT ’03, USA, pp. 66–73. External Links: Cited by: §7.
- Automatic text simplification for handling intellectual property (the case of multiple patent claims). In Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014), Dublin, Ireland, pp. 41–52. External Links: Cited by: Figure 3, §7.
- Aligning patent claims with detailed descriptions for readability. In NTCIR, Cited by: item Textual vs. graphical output, §7.
- Rhetorical structure analysis of japanese patent claims using cue phrases.. In NTCIR, Cited by: §7.
- Patent claim processing for readability: structure analysis and term explanation. In Proceedings of the ACL-2003 Workshop on Patent Corpus Processing - Volume 20, PATENT ’03, USA, pp. 56–65. External Links: Cited by: item Long sentences, §7.
- Fact-enhanced synthetic news generation. Conference on Artificial Intelligence, AAAI (English). Cited by: §3.3.
- A survey on text simplification. External Links: Cited by: §3.2.
- Using summarization techniques on patent database through computational intelligence. In Progress in Artificial Intelligence, P. Moura Oliveira, P. Novais, and L. P. Reis (Eds.), Cham, pp. 508–519. External Links: Cited by: §6.1.
- User study for measuring linguistic complexity and its reduction by technology on a patent website. In Conference: 34 International Conference on Machine Learning, ICML’17, pp. . Cited by: §3.2, §5.
- Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. Cited by: §6.2, Table 2.
- Automated patent document summarization for R&D intellectual property management. 2006 10th International Conference on Computer Supported Cooperative Work in Design, pp. 1–6. Cited by: item 2, item 3, item 4.
- A semantic based approach for automatic patent document summarization. In Collaborative Product and Service Life Cycle Management for a Sustainable World, R. Curran, S. Chou, and A. Trappey (Eds.), London, pp. 485–494. External Links: Cited by: item 2, item 3, item 4, item 5.
- Intelligent compilation of patent summaries using machine learning and natural language processing techniques. Advanced Engineering Informatics 43, pp. 101027. External Links: Cited by: §6.3.
- Automatic patent document summarization for collaborative knowledge systems and services. Journal of Systems Science and Systems Engineering 18, pp. 71–94. External Links: Cited by: item 3, item 5.
- An R&D knowledge management method for patent document. Industrial Management and Data Systems 108, pp. 245–257. External Links: Cited by: item 4.
- Text mining techniques for patent analysis. Information Processing & Management 43 (5), pp. 1216 – 1247. Note: Patent Processing External Links: Cited by: item Human- vs. machine-focused, item 3.
- Patent surrogate extraction and evaluation in the context of patent mapping. Journal of Information Science 33 (6), pp. 718–736. External Links: Cited by: item Human- vs. machine-focused.
- Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: §6.2.
- Quantifying the challenges in parsing patent claims. In Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval at ECIR 2010, pp. 14–21. Cited by: item Long sentences, item Words’ distribution and vocabulary.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5008–5020. External Links: Cited by: item Evaluation metrics.
- Towards content-oriented patent document processing. World Patent Information 30, pp. 21–33. External Links: Cited by: §7.
- Big bird: transformers for longer sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 17283–17297. External Links: Cited by: §6.2, Table 2.
- PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 11328–11339. External Links: Cited by: §6.2, Table 2.
- BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations, External Links: Cited by: item Evaluation metrics.