A patent is the exclusive right to manufacture, use, or sell an invention and is granted by the government’s patent offices . For a patent to be granted, it is indispensable that the described invention is not known or easily inferred from the so-called prior art, where prior art includes any written or oral publication available before the filing date of the submission. Therefore, for each application that is submitted, the responsible patent office performs a search for related work to check if the subject matter described in the submission is inventive enough to be patentable . Before handing in the application to the patent office, the inventors will usually consult a patent attorney, who represents them in obtaining the patent. In order to assess the chances of the patent being granted, the patent attorney often also performs a search for prior art.
When searching for prior art, patent officers and patent attorneys are currently mainly relying on simple keyword searches such as those implemented by the Espacenet tool from the European Patent Office, the TotalPatent software developed by LexisNexis, or the PatSnap patent search, all of which provide very limited semantic search options. These search engines often fail to return relevant documents and due to constraints regarding the length of the entered search text, it is usually not possible to consider a patent application’s entire text for the search, but merely query the database for specific keywords.
Current search approaches for prior art therefore require a significant amount of manual work and time, as given a patent application, the patent officer or attorney has to manually formulate a search query by combining words that should match documents describing similar inventions . Furthermore, these queries often have to be adapted several times to optimize the output of the search [18, 67]. A main problem here is that regular keyword searches do not inherently take into account synonyms or more abstract terms related to the given query words. This means, if for an important term in the patent application a synonym, such as wire instead of cable, or a more specialized term, such as needle instead of sharp object, has been used in an existing document of prior art, a keyword search might fail to reveal this relation unless the alternative term was explicitly included in the search query. This is relevant as it is quite common in patent texts to use very abstract and general terms for describing an invention in order to maximize the protective scope [64, 5]. A line of research [23, 32, 34, 31, 61] has focused on automatically expanding the manually composed queries, e.g., to take into account synonyms collected in a thesaurus [37, 35] or include keywords occurring in related patent documents [16, 39, 40]. Yet, with iteratively augmented queries – be it by manual or automatic extension of the query – the search for prior art remains a very time consuming process.
Furthermore, a keyword-based search for prior art, even if done with most professional care, will often produce suboptimal results (as we will see e.g. later in this paper and Supporting Information D.2). With possibly imperfect queries, it must be assumed that relevant documents are missed in the search, leading to false negatives (FN). On the other hand, query words can also appear in texts that, nonetheless, have quite different topics, which means the search will additionally yield many false positives (FP). When searching for prior art for a patent application, the consequences of false positives and false negatives are quite different. While false positives cause additional work for the patent examiner, who has to exclude the irrelevant documents from the report, false negatives may lead to an erroneous grant of a patent, which can have profound legal and financial implications for both the owner of said patent as well as competitors .
1.1 An approach to automate the search for prior art
To overcome some of these disadvantageous aspects of current keyword-based search approaches, it is necessary to decrease the manual work and time required for conducting the search itself, while increasing the quality of the search results by avoiding irrelevant patents from being returned, as well as automatically accounting for synonyms to reduce false negatives. This can be achieved by comparing the patent application with existing publications based on their entire texts rather than just searching for specific keywords. By considering the entire texts of the documents, much more information, including the context of keywords used within the respective documents, is taken into account. For humans it is of course infeasible to read the whole text of each possibly relevant document. Instead, state-of-the-art text processing techniques can be used for this task.
This paper describes a novel approach to automate the search for prior art with natural language processing (NLP) and machine learning (ML) techniques in order to make it more efficient and accurate. The essence of this idea is illustrated in Fig 1
. We first obtain a dataset of related patents from a patent database by using a few seed patents and then recursively adding the patents or patent applications that are cited by the documents already included in the dataset. The patent texts are then transformed into numerical feature vectors, based on which the similarity between two documents can be computed. We evaluate different similarity measures by comparing the documents that our automated approach considers as being very similar to some patent to those documents that were originally cited in this patent’s search report and, in a second step, to documents considered relevant for this patent by a patent attorney.
The remainder of the paper is structured as follows: After briefly reviewing existing strategies for prior art search as well as machine learning methods for full text similarity search and its applications, we discuss our approach for computing the similarities between the patents using different feature extraction methods. These methods are then evaluated on an example corpus of patents including their citations, as well as a second corpus where relevant patents were identified by a patent attorney. Furthermore, we assess the quality of the original citation process itself based on both corpora. A discussion of the relevance of the obtained results and a brief outlook conclude this manuscript.
1.2 Related work
Most research concerned with facilitating and improving the search for a patent’s prior art has focused on automatically composing and extending the search queries. For example, a manually formulated query can be improved by automatically including synonyms for the keywords using a thesaurus [37, 64, 35, 38, 71]. A potential drawback of such an approach, however, is that the thesaurus itself has to be manually curated and extended . Another line of research focuses on pseudo-relevance feedback, where, given an initial search, the first search results are used to identify additional keywords that can be used to extend the original query [39, 17, 18]. Similarly, past queries  or meta data such as citations can be used to augment the search query [16, 40, 41]. A recent study has also examined the possibility of using the word2vec language model [45, 46, 47] to automatically identify relevant words in the search results that can be used to extend the query .
Approaches for automatically adapting and extending queries still require the patent examiner to manually formulate the initial search query. To make this step obsolete, heuristics can be used to automatically extract keywords from a given patent application[42, 24, 69] or a bag-of-words (BOW) approach can be used to transform the entire text of a patent into a list of words that can then be used to search for its prior art [68, 11, 72]. Often times, partial patent applications, such as an extended abstract, may already suffice to conduct the search . The search results can also be further refined with a graph-based ranking model  or by using the patents’ categories to filter the results . Different prior art search approaches have previously been discussed and benchmarked within the CLEF project, see e.g.  and .
In our approach, detailed in the following sections, we also alleviate the required work and time needed to manually compose a search query by simply operating on the patent application’s entire text. However, instead of only searching the database for relevant keywords extracted from this text, we transform the texts of all other documents into numerical feature representations as well, which allow us to compute the full text similarities between the patent application and its possible prior art.
Calculating the similarity between texts is at the heart of a wide range of information retrieval tasks, such as search engine development, question answering, document clustering, or corpus visualization. Approaches for computing text similarities can be divided into similarity measures relying on word similarities and those based on document feature vectors .
To compute the similarity between two texts using individual word similarities, the words in both texts first have to be aligned by creating word pairs based on semantic similarity and then these similarity scores are combined to yield a similarity measure for the whole text. Corley and Mihalcea  propose a text similarity measure, where the most similar word pairs in two texts are determined based on semantic word similarity measures as implemented in the WordNet similarity package . The similarity score of two texts is then computed as the weighted and normalized sum of the single word pairs’ similarity scores. This approach can be further refined using greedy pairing . Recently, instead of using WordNet relations to obtain word similarities, the similarity between semantically meaningful word embeddings, such as those created by the word2vec language model , was used. Kusner et al.  defined the word mover’s distance for computing the similarity between two sentences as the minimum distance the individual word embeddings have to move to match those of the other sentence. While similarity measures based on the semantic similarities of individual words are advantageous when comparing short texts, finding an optimal word pairing for longer texts is computationally very expensive and therefore these similarity measures are less practical in our setting, where the full texts of whole documents have to be compared.
To compute the similarity between longer documents, these can be transformed into numerical feature vectors, which serve as input to a similarity function. Rieck and Laskov  give a comprehensive overview of similarity measures for sequential data, some of which are widely used in information retrieval applications. Achananuparp et al.  test some of these similarity measures for comparing sentences on three corpora, using accuracy, precision, recall, and rejection as metrics to evaluate how many of the retrieved documents are relevant in relation to the number of relevant documents missed. Huang  use several of these similarity measures to perform text clustering on tf-idf vectors. Interested in how well similarity measures reproduce human similarity ratings, Lee et al.  create a text similarity corpus based on all possible pairs of 50 different documents rated by 83 students. They test different feature extraction methods in combination with four of the similarity measures described in Rieck and Laskov 
and calculate the correlation of the human ratings with the resulting scoring. They conclude that using the cosine similarity, high precision can be achieved, while recall is still not satisfying.
Full text similarity measures have previously been used to improve search results for MEDLINE articles, where a two step approach using the cosine similarity measure between tf-idf vectors in combination with a sentence alignment algorithm yielded superior results compared to the boolean search strategy used by PubMed . The Science Concierge  computes the similarities between papers’ abstracts to provide content based recommendations, however it still requires an initial keyword search to retrieve articles of interest. The PubVis web application by Horn , developed for visually exploring scientific corpora, also provides recommendations for similar articles given a submitted abstract by measuring overlapping terms in the document feature vectors. While full text similarity search approaches have shown potential in domains such as scientific literature, only few studies have explored this approach for the much harder task of retrieving prior art for a new patent application 
, where much less overlap between text documents is to be expected due to the usage of very abstract and general terms when describing new inventions. Specifically, document representations created using recently developed neural network language models such asword2vec [45, 46, 21] or doc2vec  were not yet evaluated on patent documents.
In order to study our hypothesis that the search for prior art can be improved by automatically determining, for a given patent application, the most similar documents contained in the database based on their full texts, we need to evaluate multiple approaches for comparing the patents’ full texts and computing similarities between the documents. To do this, we test multiple approaches for creating numerical feature representations from the documents’ raw texts, which can then be used as input to a similarity function to compute the documents’ similarity.
All raw documents first have to be preprocessed by lower casing and removing non-alphanumeric characters.
The simplest way of transforming texts into numerical vectors is to create high dimensional but sparse bag-of-words (BOW) vectors with tf-idf features . These BOW representations can also be reduced to their most expressive dimensions using dimensionality reduction methods such as latent semantic analysis (LSA) [26, 48] or kernel principal component analysis
kernel principal component analysis(KPCA) [60, 49, 58, 59]. Alternatively, the neural network language models (NNLM)  word2vec [45, 46] (combined with BOW vectors) or doc2vec  can be used to transform the documents into feature vectors. All these feature representations are described in detail in the Supporting Information A.1.
Using any of these feature representations, the pairwise similarity between two documents’ feature vectors and can be calculated using the cosine similarity:
which is for documents that are (almost) identical, and (in the case of non-negative BOW feature vectors) or below for unrelated documents [13, 22, 8]. Other possible similarity functions for comparing sequential data [56, 51] are discussed in the Supporting Information A.2.
Our experiments are conducted on two datasets, created using a multi-step process as briefly outlined here and further discussed in the Supporting Information B. For ease of notation, we use the term patent when really referring to either a granted patent or a patent application.
We first obtained a patent corpus containing more than 100,000 patent documents from the Cooperative Patent Classification scheme (CPC) category A61 (medical or veterinary science and hygiene), published between 2000 and 2015. From this corpus we selected altogether 28,381 documents for our first dataset: The roughly 2,500 patents published in 2015 constitute our set of “target patents”. Each target patent cites on average 17.5 ( 28.4) other patents in our corpus (i.e. published after 2000). For each target patent, we selected the set of patents that are cited in its search report. Additionally, we randomly selected another 1,000 patents from the corpus, which were not cited by any of the selected target patents. All target patents are then paired up with their respective cited patents, as well as the 1,000 random patents. Each pair is then either assigned the label cited, if is cited in the search report of (i.e. ), or is labelled as random otherwise. This marks our first dataset consisting of 2,470,736 patent pairs with a ‘cited/random’ labelling. The patent documents in this dataset contain on average 13,530 ( 18,750) words.
The second dataset is created by obtaining additional, more consistent human labels from a patent attorney for a small subset of the first dataset. These labels should show which of the cited patents are truly relevant to the target patent and whether important prior art is missing from the search reports. For ten patents, we selected their respective cited patents as well as several random patents that either obtained a high, medium, or low similarity score as computed with the cosine similarity on tf-idf BOW features. These 450 patent pairs were then manually assigned ‘relevant/irrelevant’ labels and constitute our second dataset.
A pair of patents should have a high similarity score if the two texts address a similar or almost identical subject matter, and a low score if they are unrelated. Furthermore, if two patent documents address a similar subject matter, then one document of said pair should have been cited in the search report of the other. To evaluate the similarity computation with different feature representations, the task of finding similar patents can be modelled as a classification problem, where the samples correspond to pairs of patents. A patent pair is given a positive label, if one of the patents was cited by the other, and a negative label otherwise. We can then compute similarity scores for all pairs of patents and select a threshold for the score where we say all patent pairs with a similarity score higher than this threshold are relevant for each other while similarity scores below the threshold indicate the patents in this pair are unrelated. With a meaningful similarity measure, it should be possible to choose a threshold such that most patent pairs associated with a positive label have a similarity score above the threshold and the pairs with negative labels score below the threshold. For a given threshold, we can compute the true positive rate (TPR), also called recall, and the false positive rate
(FPR) of the classifier. By plotting the TPR against the FPR for different decision thresholds, we obtain the graph of thereceiver operating characteristic (ROC) curve, where the area under the ROC curve (AUC) conveniently translates the performance of the classifier into a number between (no separation between classes) and (clear distinction between positive and negative samples). Further details on this performance measure can be found in the Supporting Information C.
While the AUC is a very useful measure to select a similarity function based on which relevant and irrelevant patents can be reliably separated, the exact score also depends on characteristics of the dataset and may therefore seem overly optimistic . Especially in our first dataset, many of the randomly selected patents contain little overlap with the target patents and can therefore be easily identified as irrelevant. With only a small fraction of the random pairs receiving a medium or high similarity score, this means that for most threshold values the FPR will be very low, resulting in larger AUC values. To give a further perspective on the performance of the compared similarity measures, we therefore additionally report the average precision
(AP) score for the final results. For a specific threshold, precision is defined as the number of TP relative to the number of all returned documents, i.e., TP+FP. As we rank the patent pairs based on their similarity score, precision and recall can again be plotted against each other for different thresholds and the area under this curve can be computed as the weighted average of precision () and recall () for all threshold values :
The aim of our study is to identify a robust approach for computing the full text similarity between two patents. To this end, in the following we evaluate different document feature representations and similarity functions by assessing how well the computed similarity scores are aligned with the labels of our two datasets, i.e., whether a high similarity score is assigned to pairs that are labelled as cited (relevant) and low similarity scores to random (irrelevant) pairs. Furthermore, we examine the discrepancies between patents cited in a patent application’s search report and truly relevant prior art. The data and code to replicate the experiments is available online.111https://github.com/helmersl/patent_similarity_search
5.1 Using full text similarity to identify cited patents
The similarities between the patents in each pair contained in the cited/random dataset are computed using the different feature extraction methods together with the cosine similarity and the obtained similarity scores are then evaluated by computing the AUC with respect to the pairs’ labels (Table 1). The similarity scores are computed using either the full texts of the patents to create the feature vectors, or only parts of the documents, such as the patents’ abstracts or their claims, to identify which sections are most relevant for this task [14, 11]. Additionally, the results on this dataset using BOW feature vectors together with other similarity measures can be found in the Supporting Information D.1.
|Features||patent section: AUC|
|BOW + word2vec||0.9410||0.8618||0.8525|
The BOW features outperform the tested dimensionality reduction methods LSA and KPCA as well as the NNLM word2vec and doc2vec when comparing the patents’ full texts (Table 1). Yet, with AUC values greater than 0.9, all methods succeed in identifying cited patents by assigning the patents found in a target patent’s search report a higher similarity score than those that they were paired up with randomly. When only certain patent sections are taken into account, the NNLMs perform as good (word2vec) or even better (doc2vec) than the BOW vectors, and LSA performs well on the claims section as well. The comparably good performance, especially of doc2vec
, on individual sections is probably due to the fact that these feature representations are more meaningful when computed for shorter texts, whereas when combining the embedding vectors of too many individual words, the resulting document representation can be rather noisy.
When looking more closely at the score distributions obtained with BOW features on the patents’ full texts as well as their claims sections (Fig 2), it can be seen that when only using the claims sections, the scores of the duplicate patent pairs, instead of being clustered near , range nearly uniformly between and . This can be explained by divisional applications and the fact that during the different stages of a submission process, most of the time only the claims section is revised (usually by weakening the claims), such that several versions of a patent application will essentially differ from each other only in their claims whereas abstract and description remain largely unchanged [72, 11].
5.2 Identifying truly relevant patents
The search for prior art for a given patent application is in general conducted by a single person using mainly keyword searches, which might result in false positives as well as false negatives. Furthermore, as different patent applications are handled by different patent examiners, it is difficult to obtain a consistently labelled dataset. A more reliably labelled dataset would therefore be desirable to properly evaluate our automatic search approach. In the previous section, we showed that by computing the cosine similarity between feature vectors created from full patent texts we can identify patents that occur in the search report of a target patent. However, the question remains, whether these results translate to a real setting and if it is possible to find patents previously overlooked or prevent the citation of actually irrelevant patents.
To get an estimate of how many of the cited, as well as the patents identified through our automated approach, are truly relevant for a given target patent, we asked a patent attorney to label a small subsample of the first dataset. As the patent attorney labelled these patents very carefully, her decisions merit a high confidence and we therefore consider them as the ground truth when her ratings are in conflict with the citation labels.
Using this second, more reliably labelled dataset, we first assess the amount of (dis)agreement between the cited/random labelling, based on the search reports, and the relevant/irrelevant labelling, obtained from the patent attorney. We then evaluate the similarity scores computed for this second dataset to see whether our automated approach is indeed capable of identifying the truly relevant prior art for a new patent application.
Comparing the current citation process to the additional human labels
To see if documents found in the search for prior art conducted by the patent office generally coincide with the documents considered relevant by our patent attorney, the confusion matrix as well as the correlation between the two human labellings is analysed. Please keep in mind that, in general, patent examiners can only assess the relevance of prior art that was actually found by the keyword driven search.
Taking the relevant/irrelevant labelling as the ground truth, the confusion matrix (Table 2) shows that 86 FP and 18 FN are produced by the patent examiner, which results in a FPR of 23% and a FNR of 22%. The large number of false positives can, in part, be explained by applicants being required by the USPTO to file so-called information disclosure statements (IDS) including, according to the applicant, related background art . The documents cited in an IDS are then included in the list of citations by the examiner, thus resulting in very long citations lists.
To get a better understanding of the relationship between the cosine similarity computed using BOW feature vectors and the relevant/irrelevant as well as the cited/random labelling, we calculate their pairwise correlations using Spearman’s (Table 3). The highest correlation score of 0.652 is reached between the relevant/irrelevant labelling and the cosine similarity, whereas Spearman’s for the cosine similarity and the cited/random labels is much lower (0.501).
When plotting the cosine similarity and the relevant/irrelevant labelling against each other for individual patents (e.g. Fig 3
), in most cases, the scorings agree on whether a patent is relevant or not for the target patent. Yet it is worthwhile to inspect some of the outliers to get a better understanding of the process. In the Supporting InformationD.2 we discuss two false positives, one produced by our approach and one found in a patent’s search report. More problematic, however, are false negatives, i.e., prior art that was missed when filing the application. For the target patent with ID US20150018885 our automated approach would have discovered a relevant patent, which was missed by the search performed by the patent examiner (Fig 3).
The patent with ID US20110087291 must be considered as relevant for the target patent, because both describe rigid bars that are aimed at connecting vertebrae for stabilization purposes with two anchors that are screwed into the bones. While in the target patent, the term bone anchoring member is used, the same part of the device in patent US20110087291 is called connecting member, which is a more abstract term. Moreover, instead of talking about a connecting bar, as it is done in the target patent, the term elongate fusion member is used in the other patent application.
Using full text similarity to identify relevant patents
In order to systematically assess how close the similarity score ranking can get to the one of the patent attorney (relevant/irrelevant) compared to the one of the patent office examiners (cited/random), the experiments performed on the first dataset with respect to the cited/random labelling were again conducted on this dataset subsample. For the analysis, it is important to bear in mind that this dataset is different from the one used in the previous experiments, as it only consists of the 450 patent pairs scored by the patent attorney. For each of the feature extraction methods, it was assessed how well the cosine similarity could distinguish between the relevant and irrelevant as well as the cited and random patent pairs of this smaller dataset.
The AUC and AP values achieved with the different feature representations on both labellings as well as, for comparison, on the original dataset, are reported in Table 4.
|BOW + word2vec||0.8408||0.8544||0.9410||0.5443||0.7354||0.4019|
On this dataset subsample, the AUC w.r.t. the cited/random labelling is much lower than in the previous experiment on the larger dataset (0.806 compared to 0.956 for BOW features), which can be in part explained by the varying number of easily identifiable negative samples and their impact on the FPR: The full cited/random dataset contains many more low-scored random patents than the relevant/irrelevant subsample, where we included an equal amount of low- and high-scored random patents for each of the ten target patents. Yet, for almost all feature representations, the performance is better for the relevant/irrelevant than for the cited/random labelling of the dataset subsample, and the best results on the relevant/irrelevant labelling are achieved using the combination of BOW vectors and word2vec embeddings as feature vectors.
The search for prior art for a given patent application is currently based on a manually conducted keyword search, which is not only time consuming but also prone to mistakes yielding both false positives and, more problematically, false negatives. In this paper, an approach for automating the search for prior art was developed, where a patent application’s full text is automatically compared to the patents contained in a database, yielding a similarity score based on which the patents can be ranked from most similar to least similar. The patents whose similarity scores exceed a certain threshold can then be suggested as prior art.
Several feature extraction methods for transforming documents into numerical vectors were evaluated on a dataset consisting of real world patent documents. In a first step, the evaluation was performed with respect to the distinction between cited and random patents, where cited patents are those included in the given target patent’s search report and random patents are randomly selected patent documents that were not cited by any of the target patents. We showed that by computing the cosine similarity between feature vectors created from full patent texts, we can reliably identify patents that occur in the search report of a target patent. The best distinction between these cited and random patents on the full corpus could be achieved when computing the cosine similarity using the well-established tf-idf BOW features, which is conceptually the method most closely related to a regular keyword search.
To examine the discrepancies between the computed similarity scores and cited/random labels, we obtained additional and more reliable labels from a patent attorney to identify truly relevant patents. As illustrated by Tables 3 and 4, the automatically calculated similarities between patents are closer to the patent attorney’s relevancy scoring than to the cited/random labellings obtained from the search report. The comparison of different feature representations on the smaller dataset not only showed that the same feature extraction method reaches different AUCs for the two labellings, but also that the feature extraction method that best distinguishes between cited and random patents on the full corpus (BOW) was outperformed on the relevant/irrelevant dataset by the combination of tf-idf BOW feature vectors with word2vec embeddings. This again indicates that the keyword search is missing patents that use synonyms or more general and abstract terms, which can be identified using the semantically meaningful representations learned by a NNLM. Therefore, with our automated similarity search, we are able to identify the truly relevant documents for a given patent application.
Most importantly, we gave an example where the cosine similarity caught a relevant patent originally missed by the patent examiner (Fig 3). As discussed at the beginning of this paper, missing a relevant prior art document in the search is a serious issue, as this might lead to an erroneous grant of a patent with profound legal and financial implications for both the applicant as well as competitors.
Consequently, our findings show that the search for prior art for a given patent application, and thereby the citation process, can be greatly enhanced by a precursory similarity scoring of the patents based on their full texts. With our NLP based approach we would not only greatly accelerate the search process, but, as shown in our empirical analysis, our method could also improve the quality of the results by reducing the number of omitted yet relevant documents.
Given the so far unsatisfying FPR (23%) and FNR (22%) of the standard citation process compared to the relevancy labellings provided by our patent attorney, in the future it is clearly desirable to focus on improving the separation of relevant and irrelevant instead of cited and random patents. Our results on the small relevant/irrelevant dataset, while very encouraging, should only be considered as a first indicative step; clearly the creation of a larger dataset, reliably labelled by several experts, will be an essential next step for any further evaluation.
Furthermore, the methods discussed within this paper should also be applied to documents from other CPC classes to assess the quality of the automatically generated search results in domains other than medical or veterinary science and hygiene. Additionally considering the (sub)categories of the patents as features when conducting the search for prior art also seems like a promising step to further enhance the search results [70, 36].
It should also be evaluated how well these results translate to patents filed in other countries [53, 33], especially if these patents were automatically translated using machine translation methods [65, 15]. Here it may also be important to take a closer look at similarity search results obtained by using only the texts from single patent sections. As related work has shown [11, 14], an extended abstract and description may often suffice to find prior art. This can speed up the patent filing process, as all relevant prior art can already be identified early in the patent application process, thereby reducing the number of duplicate submissions with only revised (i.e. weakened) claims. However, as patents filed in different countries have different structures, these results might not directly translate to, e.g., patents filed with the European Patent Office.
It might also be of interest to compare other NNLM based feature representations for this task, e.g., by combining the word2vec
embeddings with a convolutional neural network[6, 7]. To better adapt a similarity search approach to patents from other domains, it could also be advantageous to additionally take into account image based similarities computed from the sketches supplied in the patent documents [4, 31].
An important challenge to solve furthermore is how an exhaustive comparison of a given patent application to all the millions of documents contained in a real world patent database could be performed efficiently. Promising approaches for speeding up the similarity search for all pairs in a set  should be explored for this task in future work.
The search for a patent’s prior art is a particularly difficult problem, as patent applications are purposefully written in a way that is to create little overlap with other patents, as only by distinguishing the invention from others, a patent application has a chance of being granted . By showing that our automated full text similarity search approach successfully improves the search for a patent’s prior art, consequently these methods are also promising candidates for enhancing other document searches, such as identifying relevant scientific literature.
This work was supported by the Federal Ministry of Education and Research (BMBF) for the Berlin Big Data Center BBDC (01IS14013A) and the Institute for Information & Communications Technology Promotion, funded by the Korea government (MSIT) (No. 2017-0-00451, No. 2017-0-01779).
Author contributions statement
FH, FB, and KRM discussed and conceived the experiments, LH conducted the experiments, FB and TO labelled the subsample of the dataset. All authors wrote and reviewed the manuscript. Correspondence to LH, FH, and KRM.
- usp  Information Disclosure Statements, chapter 609. United States Patent and Trademark Office, 2018.
- Achakulvisut et al.  Titipat Achakulvisut, Daniel E Acuna, Tulakan Ruangrong, and Konrad Kording. Science concierge: A fast content-based recommendation system for scientific publications. PLOS ONE, 11(7):e0158423, 2016.
- Achananuparp et al.  Palakorn Achananuparp, Xiaohua Hu, and Xiajiong Shen. The evaluation of sentence similarity measures. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery, DaWaK ’08, pages 305–316, Berlin, Heidelberg, 2008. Springer-Verlag.
- Alberts et al.  Doreen Alberts, Cynthia Barcelon Yang, Denise Fobare-DePonio, Ken Koubek, Suzanne Robins, Matthew Rodgers, Edlyn Simmons, and Dominic DeMarco. Introduction to Patent Searching, chapter 1, pages 3–45. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Andersson et al.  Linda Andersson, Allan Hanbury, and Andreas Rauber. The Portability of Three Types of Text Mining Techniques into the Patent Text Genre, chapter 9, pages 241–280. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Arras et al.  Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 1–7, Berlin, Germany, 8 2016. Association for Computational Linguistics.
- Arras et al.  Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. "what is relevant in a text document?": An interpretable machine learning approach. PloS one, 12(8):e0181142, 2017.
- Baeza-Yates and Ribeiro-Neto  Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
- Bayardo et al.  Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 131–140, New York, NY, USA, 2007. ACM.
- Bengio et al.  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003. ISSN 1532-4435.
Bouadjenek et al. 
Mohamed Reda Bouadjenek, Scott Sanner, and Gabriela Ferraro.
A study of query reformulation for patent prior art search with
partial patent applications.
Proceedings of the 15th International Conference on Artificial Intelligence and Law, pages 23–32. ACM, 2015.
- Corley and Mihalcea  Courtney Corley and Rada Mihalcea. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, EMSEE ’05, pages 13–18, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.
- Crocetti  Giancarlo Crocetti. Textual spatial cosine similarity. CoRR, abs/1505.03934, 2015.
- D’hondt and Verberne  Eva D’hondt and Suzan Verberne. Clef-ip 2010: Prior art retrieval using the different sections in patent documents. In CLEF-IP 2010. Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010), CLEF-IP workshop. Padua, Italy:[sn], 2010.
- Diallo and Lupu  Barrou Diallo and Mihai Lupu. Future Patent Search, chapter 17, pages 433–455. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Fujii  Atsushi Fujii. Enhancing patent retrieval by citation analysis. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 793–794. ACM, 2007.
- Ganguly et al.  Debasis Ganguly, Johannes Leveling, Walid Magdy, and Gareth JF Jones. Patent query reduction using pseudo relevance feedback. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 1953–1956. ACM, 2011.
- Golestan Far et al.  Mona Golestan Far, Scott Sanner, Mohamed Reda Bouadjenek, Gabriela Ferraro, and David Hawking. On term selection techniques for patent prior art search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 803–806. ACM, 2015.
- Gomaa and Fahmy  Wael H Gomaa and Aly A Fahmy. A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 2013.
- Horn [2017a] Franziska Horn. Interactive exploration and discovery of scientific publications with pubvis. arXiv preprint arXiv:1706.08094, 2017a.
- Horn [2017b] Franziska Horn. Context encoders as a simple but powerful extension of word2vec. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 10–14. Association for Computational Linguistics, 2017b.
- Huang  Anna Huang. Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), pages 49–56, 2008.
- Kando and Leong  Noriko Kando and Mun-Kew Leong. Workshop on patent retrieval (sigir 2000 workshop report). In SIGIR Forum, volume 34, pages 28–30, 2000.
- Konishi  Kazuya Konishi. Query terms extraction from patent document for invalidity search. In Proceedings of NTCIR-5 Workshop Meeting, 2005-12, 2005.
- Kusner et al.  Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International Conference on Machine Learning, pages 957–966, 2015.
- Landauer et al.  Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25:259–284, 1998.
- Le and Mikolov  Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188–1196. JMLR Workshop and Conference Proceedings, 2014.
- Lee et al.  Michael D Lee, Brandon Pincombe, and Matthew Welsh. An empirical evaluation of models of text document similarity. In Proceedings of the Cognitive Science Society, volume 27, pages 1254–1259, 2005.
- Lewis et al.  James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami, and Harold R Garner. Text similarity: an alternative way to search medline. Bioinformatics, 22(18):2298–2304, 2006.
- Lintean and Rus  Mihai C Lintean and Vasile Rus. Measuring semantic similarity in short texts through greedy pairing and word semantics. In FLAIRS Conference, pages 244–249, 2012.
- Lupu and Hanbury  Mihai Lupu and Allan Hanbury. Patent retrieval. Foundations and Trends® in Information Retrieval, 7(1):1–97, 2013.
- Lupu et al.  Mihai Lupu, Katja Mayer, and Anthony J Trippe. Current Challenges in Patent Information Retrieval, volume 29. Springer, 2011.
- Lupu et al. [2017a] Mihai Lupu, Atsushi Fujii, Douglas W. Oard, Makoto Iwayama, and Noriko Kando. Patent-Related Tasks at NTCIR, chapter 3, pages 77–111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017a. ISBN 978-3-662-53817-3.
- Lupu et al. [2017b] Mihai Lupu, Katja Mayer, Noriko Kando, and Anthony J Trippe. Current Challenges in Patent Information Retrieval, volume 37. Springer, 2017b.
- Lupu et al. [2017c] Mihai Lupu, Florina Piroi, and Veronika Stefanov. An Introduction to Contemporary Search Technology, chapter 2, pages 47–73. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017c. ISBN 978-3-662-53817-3.
- Magali et al.  Mireles Magali, Gabriela Ferraro, and Geva Shlomo. Four patent classification problems in information management: A review of the literature and a determination of the four essential questions for future research. Information Research, 21(1):paper 705, 2016.
- Magdy and Jones  Walid Magdy and Gareth JF Jones. A study on query expansion methods for patent retrieval. In Proceedings of the 4th workshop on Patent information retrieval, pages 19–24. ACM, 2011.
- Magdy et al.  Walid Magdy, Johannes Leveling, and Gareth JF Jones. Exploring structured documents and query formulation techniques for patent retrieval. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 410–417. Springer, 2009.
- Mahdabi and Crestani  Parvaz Mahdabi and Fabio Crestani. Learning-based pseudo-relevance feedback for patent retrieval. In Information Retrieval Facility Conference, pages 1–11. Springer, 2012.
- Mahdabi and Crestani [2014a] Parvaz Mahdabi and Fabio Crestani. The effect of citation analysis on query expansion for patent retrieval. Information retrieval, 17(5-6):412–429, 2014a.
- Mahdabi and Crestani [2014b] Parvaz Mahdabi and Fabio Crestani. Query-driven mining of citation networks for patent citation retrieval and recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1659–1668. ACM, 2014b.
- Mahdabi et al.  Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni, and Fabio Crestani. Building queries for prior-art search. In Information Retrieval Facility Conference, pages 3–15. Springer, 2011.
- Manning et al.  Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
- Mihalcea and Tarau  Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, 2004.
- Mikolov et al. [2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
- Mikolov et al. [2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pages 3111–3119, 2013b.
- Mikolov et al. [2013c] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013c.
- Moldovan et al.  Andreea Moldovan, Radu Ioan Boţ, and Gert Wanka. Latent semantic indexing for patent documents. International Journal of Applied Mathematics and Computer Science, 15:551–560, 2005.
- Müller et al.  Klaus-Robert Müller, Sebastian Mika, Gunnar Rätsch, Koji Tsuda, and Bernhard Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.
- Patwardhan et al.  Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing’03, pages 241–257, Berlin, Heidelberg, 2003. Springer-Verlag.
- Pele  Ofir Pele. Distance Functions: Theory, Algorithms and Applications. PhD thesis, The Hebrew University of Jerusalem, 2011.
- Piroi  Florina Piroi. Clef-ip 2010: Classification task evaluation summary. Technical report, Technical Report IRF TR 2010 00004, Information Retrieval Facility, Vienna, 2010.
- Piroi and Hanbury  Florina Piroi and Allan Hanbury. Evaluating Information Retrieval Systems on European Patent Data: The CLEF-IP Campaign, chapter 4, pages 113–142. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Piroi et al.  Florina Piroi, Mihai Lupu, and Allan Hanbury. Overview of clef-ip 2013 lab. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 232–249. Springer, 2013.
- Publication  WIPO Publication. WIPO Intellectual Property Handbook Second Edition, volume No. 489 (E). WIPO, 2004. ISBN 978-92-805-1291-5.
- Rieck and Laskov  Konrad Rieck and Pavel Laskov. Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res., 9:23–48, June 2008.
- Saito and Rehmsmeier  Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015.
Schölkopf and Smola 
Bernhard Schölkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
- Schölkopf and Smola  Bernhard Schölkopf and Alexander J. Smola. A short introduction to learning with kernels. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning: Machine Learning Summer School 2002 Canberra, Australia, February 11–22, 2002 Revised Lectures, pages 41–64, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
Schölkopf et al. 
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.
Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319, 1998.
- Shalaby and Zadrozny  Walid Shalaby and Wlodek Zadrozny. Patent retrieval: A literature review. arXiv preprint arXiv:1701.00324, 2017.
- Singh and Sharan  Jagendra Singh and Aditi Sharan. Relevance feedback-based query expansion model using ranks combining and word2vec approach. IETE Journal of Research, 62(5):591–604, 2016.
- Tannebaum and Rauber  Wolfgang Tannebaum and Andreas Rauber. Using query logs of uspto patent examiners for automatic query expansion in patent searching. Information retrieval, 17(5-6):452–470, 2014.
- Tannebaum and Rauber  Wolfgang Tannebaum and Andreas Rauber. Patnet: a lexical database for the patent domain. In European Conference on Information Retrieval, pages 550–555. Springer, 2015.
- Tinsley  John Tinsley. Machine Translation and the Challenge of Patents, chapter 16, pages 409–431. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Trippe and Ruthven  Anthony Trippe and Ian Ruthven. Evaluating Real Patent Retrieval Effectiveness, chapter 5, pages 143–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
- Tseng et al.  Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin. Text mining techniques for patent analysis. Inf. Process. Manage., 43(5):1216–1247, 2007.
- Verberne and D’hondt  Suzan Verberne and Eva D’hondt. Prior art retrieval using the claims section as a bag of words. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 497–501. Springer, 2009.
- Verma and Varma [2011a] Manisha Verma and Vasudeva Varma. Applying key phrase extraction to aid invalidity search. In Proceedings of the 13th International Conference on Artificial Intelligence and Law, pages 249–255. ACM, 2011a.
- Verma and Varma [2011b] Manisha Verma and Vasudeva Varma. Exploring keyphrase extraction and ipc classification vectors for prior art search. In CLEF (Notebook Papers/Labs/Workshop), 2011b.
- Wang et al.  Feng Wang, Lanfen Lin, Shuai Yang, and Xiaowei Zhu. A semantic query expansion-based patent retrieval approach. In Fuzzy Systems and Knowledge Discovery (FSKD), 2013 10th International Conference on, pages 572–577. IEEE, 2013.
- Xue and Croft  Xiaoibng Xue and W Bruce Croft. Transforming patents into prior-art queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 808–809. ACM, 2009.
- Zhang  Longhui Zhang. An Integrated Framework for Patent Analysis and Mining. FIU Electronic Theses and Dissertations, 2016.
- Zhu  Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2:30, 2004.
Appendix A Supporting Information: Methods
a.1 Feature representations of text documents
Tf-Idf BOW features
Given documents with a vocabulary of size , each text is transformed into a bag-of-words (BOW) feature vector by first computing a normalized count, the term frequency (tf), for each word in a text, and then weighting this by the word’s inverse document frequency (idf) to reduce the influence of very frequent but inexpressive words that occur in almost all documents (such as ‘and’ and ‘the’) . The idf of a term is calculated as the logarithm of the total number of documents, , divided by the number of documents that contain term , i.e.
The entry corresponding to the word in the feature vector of a document is then
Instead of using the term frequency, a binary entry in the feature vector for each word occurring in the text might often suffice. Furthermore, the final tf-idf vectors can be normalized by dividing them e.g. by the maximum or the length of the respective vector:
LSA and KPCA
Transforming the documents in the corpus into BOW vectors leads to a high-dimensional but sparse feature matrix. These feature representations can be reduced to their most expressive dimensions, which helps to reduce noise in the data and create more overlap between vectors. For this, we experiment with both latent semantic analysis (LSA)  and kernel principal component analysis (KPCA) .
represents a word’s meaning as the average of all the passages the word appears in, and a passage, such as a document, as the average of all the words it contains. Mathematically, a singular value decomposition (SVD) of the BOW feature matrix for the respective corpus is performed. The original data points can then be projected onto the vectors corresponding to the largest singular values of matrix , yielding a lower-dimensional representation , where . Choosing a dimensionality that is smaller than the original dimension is assumed to lead to a deeper abstraction of words and word sequences and to give a better approximation of their meaning .
to obtain a low dimensional representation of the data, again based on the eigenvectors corresponding to the largest eigenvalues of this matrix. While we have studied different Gaussian kernels, we found that good results could already be obtained using the linear kernel.
When reducing the dimensionality of the BOW feature vectors with LSA and KPCA, four embedding dimensions (100, 250, 500 and 1000) were tested and the best performance on the full texts was achieved using 1000 dimensions. As the dataset subsample contains only 450 patent pairs, here the best results with LSA and KPCA were achieved using only 100 dimensions.
Combining BOW features with word2vec embeddings
One shortcoming of the BOW vectors is that semantic relationships between words, such as synonymy, as well as word order, are not taken into account. This is due to the fact that each word is associated with a single dimension in the feature vector and therefore the distances between all words are equal. The aspect of synonymy is especially relevant for patent texts, where very abstract and general terms are used for describing an invention in order to assure a maximum degree of coverage. For instance, a term like fastener might be preferred over the usage of the term screw, as it includes a wider range of material and therefore gives a better protection against infringement. Thus, patent texts tend to contain neologisms and abstract words that might even be unique in the corpus. To account for this variety in a keyword search is especially tedious and prone to errors as the examiner has to search for synonyms at different levels of abstraction or rely on a thesaurus, which would then need to be kept up-to-date . Even the BOW approach could in this case only capture the similarity between the patent texts if the words in the context are similar. An approach specifically developed to overcome these restrictions are neural network language models (NNLM) , which aim at representing words or documents by semantically meaningful vectorial embeddings.
A NNLM that recently received a lot of attention is word2vec. Its purpose is to embed words in a vector space based on their contexts, such that terms appearing in similar contexts are close to each other in the embedding space w.r.t. the cosine similarity [45, 46, 21]. Given a text corpus, the word representations are obtained by training a neural network that learns from the local contexts of the input words in the corpus. The embedding is then given by the learned weight matrix. Mikolov et al.  describe two different network architectures for training the word2vec model, namely the continuous bag-of-words (CBOW) and the skip-gram model. The first one learns word representations by predicting a target word based on its context words and the latter one by predicting the context words for the current input word. As the skip-gram model showed better performance in analogy tasks [45, 46, 47] it is used in this paper.222Analogy tasks aim at finding relations such as A is to B as C is to . For instance, in the relation good is to better as bad is to , the correct answer would be worse.
To make use of the information learned by the word2vec model for each word in the corpus vocabulary , the trained word embeddings have to be combined to create a document vector for each patent text. To this end, the dot product of each document’s BOW vector with the word embedding matrix , containing one -dimensional word embedding per row, is calculated. For each document represented by a BOW vector , this results in a new document vector , which corresponds to the sum of the word2vec embeddings of the terms occurring in the document, weighted by their respective tf-idf scores. Combining the BOW vectors and the word embeddings thus comes along with a dimensionality reduction of the document vectors, while their sparseness is lost.
With doc2vec, Le and Mikolov  extend the word2vec model to directly represent word sequences of arbitrary lengths, such as sentences, paragraphs or even whole documents, by vectors. To learn the representations, word and paragraph vectors are trained simultaneously for predicting the next word for different contexts of fixed size sampled from the paragraph such that, at least in small contexts, word order is taken into account. Words are mapped to a unique embedding in a matrix and paragraphs to a unique embedding in a matrix . In each training step, paragraph and word embeddings are combined by concatenation to predict the next word given a context sampled from the respective paragraph. After training, the doc2vec model can be used to infer the embedding for an unseen document by performing gradient descent on the document matrix after having added more rows to it and holding the learned word embeddings and softmax weights fixed .
For the doc2vec model, we explored the parameter values 50, 100, 200 and 500 for the embedding dimension of the document vectors on the cited/random dataset in preliminary experiments, with the best results achieved with . The window size was set to 8, the minimum word count to 5, and the model was trained for 18 iterations. When training the model, the target patents were excluded from the corpus to avoid overfitting. Their document vectors were then inferred by the model given the learned parameters before computing the similarities to the other patents.
a.2 Functions for measuring similarity between text documents
Transforming the patent documents into numeric feature vectors allows to assess their similarity with the help of mathematical functions. Rieck and Laskov  give a comprehensive overview on vectorial similarity measures for the pairwise comparison of sequential data. These can be divided into three main categories, namely kernels, distance functions, and similarity coefficients. Their formulas are shown in Table 5 and the notation is consistent with the one in the paper. Here, corresponds to a word in the vocabulary of the corpus, and maps each word to its normalized and weighted count in sequence , i.e. to its tf-idf value. The similarity functions will be briefly described in the following, while further details can be found in the original publication .
The general idea for the comparison of two sequences is that the more overlap they show with respect to their subsequences, the more similar they are. When transforming texts into BOW features, a subsequence corresponds to a single word. Two sequences and can thus be compared based on the normalized and weighted counts of the subsequences stored in the respective feature vectors and .
The first group of similarity measures Rieck and Laskov  discuss are kernel functions. They implicitly map the feature vectors into a possibly high or even infinite dimensional feature space, where the kernel can be expressed as a dot product. A kernel thus has the general form
where maps the vectors into the kernel feature space. The advantage of the kernel function is that it avoids the explicit calculation of the vectors’ high dimensional mapping and allows to obtain the result in terms of the vectors’ representation in the input space instead [59, 58].
The distance functions described in Rieck and Laskov  are so-called bin-to-bin distances . This means that they compare each component of the vector to its corresponding component in the other one, e.g. by subtracting the respective word counts and summing the subtractions for all words in the vocabulary. Unlike similarity measures, the distance measures are higher the more different the compared sequences are but can be easily transformed into a similarity measure by multiplying the result with , for example.
Similarity coefficients were designed for the comparison of binary vectors and, instead of expressing metric properties, they assess similarity by comparing the number of matching components between two sequences. More precisely, for calculating the similarity of two sequences and , they use three variables a, b and c, where a corresponds to the number of components contained in both and , b to the number of components contained in but not in , and c to the number of components contained in but not in . In the case of BOW vectors, which are not inherently binary, the three variables can be expressed as follows:
Appendix B Supporting Information: Data
To evaluate the different methods for computing document similarities on real world data, an initial patent corpus was obtained from a patent database. This corpus consists of over 100,000 patent grants and applications published at the United States Patent and Trademark Office (USTPO) between 2000 and 2015.
We create such a patent corpus (by crawling Google Patents333https://www.google.de/patents) as illustrated in Fig 4. To get a more homogeneous dataset, only patents of the category A61 (medical or veterinary science and hygiene) according to the Cooperative Patent Classification scheme (CPC) were included in our corpus. Another important criterion for including a patent document in our initial patent corpus was that its search report, i.e. the prior art cited by the examiner, had to be available from the database. Starting with 20 manually selected seed patents published in 2015, the patent corpus was iteratively extended by including the seed patents’ citations if they were published after 1999 and belonged to the category A61. The citations of these patents were then again checked for publication year and category and included if they fulfilled the respective conditions.
Structure of the crawled dataset
Comparing the distribution of patents published per year in the dataset and the total amount of patents filed between 2000 and 2015 at the USTPO (Fig 5), it can be seen that the distribution in the dataset is not representative. The peak in 2003 and the fact that there are less and less patents with a publication date in the following years is most probably a result of the crawling strategy. Given that we started with some patents filed in 2015 and then subsequently crawled the citations, published in the past, explains the low amount of patents published in more recent years in the dataset.
The same holds for the subcategory distribution displayed in Fig 6. While the most prominent subcategory in our dataset is A61B, the most frequent subcategory is actually A61K. The bias for subcategory A61B is due to the fact that several seed patents belonged to it.
|Apparatus or methods for oral or dental hygiene|
|A61D||Veterinary instruments, implements, tools, or methods|
|A61F||Filters implantable into blood vessels|
|Devices providing patency to or preventing collapsing of tubular structures of the body, e.g. stents|
|Orthopaedic, nursing or contraceptive devices|
|Treatment or protection of eyes or ears|
Bandages, dressings or absorbent pads
|A61G||Transport or accomodation for patients|
|Operating tables or chairs|
|Chairs for dentistry|
|A61H||Physical therapy apparatus, e.g. devices for locating or stimulating reflex points in the body|
|Bathing devices for special therapeutic or hygienic purposes or specific parts of the body|
|A61J||Containers specially adapted for medical or pharmaceutical purposes|
|Devices or methods specially adapted for bringing pharmaceutical products into particular physical or administering forms|
|Devices for administering food or medicines orally|
|Devices for receiving spittle|
|A61K||Preparations for medical, dental, or toilet purposes|
|A61L||Methods or apparatus for sterilising materials or objects in general|
|Disinfection, sterilisation, or deodorisation of air|
|Chemical aspects of bandages, dressings, absorbent pads, or surgical articles|
|Materials for bandages, dressings, absorbent pads, or surgical articles|
|A61M||Devices for introducing media into, or onto, the body|
|Devices for transducing body media or for taking media from the body|
|Devices for producing or ending sleep or stupor|
|A61Q||Specific use of cosmetics or similar toilet preparations|
Finally, to get some insights into the existing search for prior art, we examine the distribution of the number of citations in the patent dataset. The citation counts for a subsample of 5000 randomly selected patents show that the distribution follows Zipf’s law with many patents having very few citations and a low number of patents having many citations (Fig 7).
Structure of a patent
The requirements regarding the structure of a patent application are very strict and prescribe the presence of certain sections and what their content should be. For the automated comparison of texts it can be interesting to have a closer look at the different sections of the documents as it might, for instance, be sufficient to only compare a specific section of the texts. This can on the one hand be useful to perform a preliminary search for prior art before the patent text is written in its entirety in order to prevent unnecessary work and on the other hand, it can help to decrease the computational burden of preprocessing and comparing full texts.
The Patent Cooperation Treaty (PCT) by the World Intellectual Property Organization (WIPO) defines several obligatory sections a patent application must contain.444The WIPO is an agency of the United Nations with the aim of unifying and fostering the protection of intellectual property. According to their requirements, a patent application should consist of a title, an abstract, the claims, and the description, where the invention is thoroughly described and the figures included in the document are explained in depth. Similar to scientific publications, a patent’s abstract consists of a short summary of what the invention is about. The claims section plays a very special role in a patent application, as it defines the extent of the protection the patent should guarantee for the invention and is therefore the section the patent attorneys and patent officers base their search for prior art on. If the claims enter in conflict with already existing publications, they can be edited by weakening the protection requirements, which is why this section is reformulated the most during the possibly multiple stages of a patent process.
As both the USTPO and the European Patent Office (EPO) adopt the PCT, the required sections are the same in the United States and in Europe. Nonetheless, some differences in the length of the description section can be observed. For a patent application handed in at the USTPO, this section mostly consists of the figures’ descriptions, while for applications to the EPO it contains more abstract descriptions of the invention itself. This is due to stricter requirements of consistency between claims and description for European patents and must be taken into account when patents filed at different offices are compared, as this might result in lower similarity scores [53, 33].
Constructing a labelled dataset with cited and random patents
A first labelled dataset was constructed from the patent corpus by pairing up the patents and labelling each pair depending on whether or not one patent in the pair is cited by the other. More formally, let be the set of patents in the corpus and its Cartesian product. Each patent pair then gets assigned the label (cited) if is contained in the search report of patent and (random) otherwise. As some of the tested approaches are computationally expensive, we did not pair up all of the 100,000 documents in the corpus. Instead, the roughly 2,500 patents published in 2015 contained in the corpus were selected as a set of target patents and paired up with their respective citations as well as with a set of 1,000 randomly selected patents that were not contained in the search reports of any of the target patents.
Due to divisional applications and parallel filings and because claims are often changed during the application process, patents with the same description may appear several times with different IDs, which is why, as a sanity check, duplicates for some of the target patents were included in the dataset as well.555Duplicates are expected to receive a similarity score near or equal to 1. All together, this ‘cited/random’ labelled dataset consists of 2,470,736 patent pairs, of which 41,762 have a citation, 2,427,000 a random, and 1974 a duplicate relation.
Obtaining relevancy labels from a patent attorney
As a subsample of the first dataset, our second dataset was constructed by taking ten of the target patents published in 2015, as well as their respective cited patents. In addition to that, in order to assess if relevant patents were missing from the search report, some of the random patents were included as well. These were selected based on their cosine similarity to the target patent, computed using the BOW vector representations. We chose for each patent the ten highest-scored, ten very low-ranked, and ten mid-ranked random patents. In total, this dataset subsample consists of 450 patent pairs, of which 151 are citations and 299 random pairs.
Neither knowing the similarity score of the patent pairs nor which ones were cited or random patents, the patent attorney manually assigned a score between 0 and 5 to the patent pairs according to how relevant the respective document was considered for the target patent, thus yielding the second labelled dataset. For most of the following evaluation, the patent attorney’s scoring was transformed into a binary labelling by considering all patent pairs with a score greater than as relevant and the others as irrelevant.
Appendix C Supporting Information: Evaluation
Computing AUC scores to evaluate similarity measures
Both the positive and the negative samples (i.e. pairs of patents) are associated with a distribution of similarity scores and ideally, these two distributions of scores would be separated, such that it is easy to chose a threshold to identify a positive or negative sample based on the corresponding similarity score of the patent pair (Fig 8). To measure how well these two distributions are separated, we can compute the area under the receiver operating characteristic (ROC) curve. Every possible threshold value chosen for separating positive from negative examples can lead to some pairs of unrelated patents to be mistakenly considered as relevant, what is called false positives (FP), or to pairs of related patents mistakenly regarded as irrelevant, so-called false negatives (FN). Correct decisions are either true negatives (TN), i.e., a pair of random patents that was correctly considered as irrelevant, or true positives (TP), which are correctly detected cited patents. Based on this, for every threshold value we can compute the true positive rate (TPR), also called recall, the false positive rate (FPR), and the false negative rate (FNR) to set wrong and correct decisions into relation:
By plotting the TPR against the FPR of a binary classifier for different decision thresholds, we then obtain the graph of the ROC curve, where the area under the ROC curve (AUC) conveniently translates the performance of the classifier into a number between (no separation between distributions) and (clear distinction between positive and negative samples), as shown in Fig 8.666Many information retrieval applications use precision and recall to measure the system’s performance by comparing the number of relevant documents to the number of retrieved documents. However, since we do not only want to retrieve relevant documents, but in general select a discriminatory, interpretable, and meaningful similarity score, we consider the AUC, which relates the system’s recall to its FPR.
Appendix D Supporting Information: Results
d.1 Identifying cited patents using different similarity functions with BOW features
We evaluated all similarity measures listed in Table 5 using BOW features on the cited/random corpus. When computing the BOW features, we either used the term frequency (tf) or a binary flag () for each word occurring in a document and experimented with raw values as well as values weighted by the words’ idf scores. Furthermore, these feature vectors were either normalized by the vector’s maximum value or its length. The AUC scores for all these combinations can be found in Table 7.
The linear kernel with length normalized vectors corresponds to the cosine similarity.
The AUC is equal, as for length normalized vectors (i.e. ), we get
and is equal to the cosine similarity.
For all similarity functions (excluding the Minkowski distance) the best result is obtained when using either tf (distance functions) or tf-idf (kernel functions, similarity coefficients, as well as Canberra and Euclidean distance) feature vectors. This shows that it is important to consider how often each term occurs in the documents instead of only encoding its presence or absence. Another observation that can be made is that the majority of the highest AUC scores is obtained on the tf-idf feature vectors, which give a more accurate insight on how important each term actually is for the given document and reduce the importance of stop words. Except for the Chebychev distance, the final normalization of the vectors should be performed using their lengths and not their maximum values. This might be due to the fact that the length normalization takes all the vector entries into account and not only the highest one, which makes it less sensitive to outliers, i.e. extremely high values in the vector. With length normalized vectors as input, the linear kernel is equal to the cosine similarity and can thus be included into the group of similarity coefficients.
All in all, except for the Euclidean distance, which gives the same AUC as the cosine similarity using normalized vectors, the kernel functions and similarity coefficients yield much better results than the distance measures, which shows that it is more important to focus on words the texts have in common instead of calculating their distance in the vector space. Among similarity coefficients and kernel functions, the former function class gives slightly more robust results. Given that similarity coefficients are especially designed for sequence comparison by explicitly taking into account their subsequences’ overlap, they seem to be the appropriate function class for measuring similarity between the BOW feature vectors.
The cosine similarity is widely used in information retrieval [13, 22, 8] and is well suited to distinguish between cited and random patents as it assigns lower scores to random than to cited patent pairs and, additionally, reliably detects duplicates by assigning them a score near or equal to (Fig 2 in the main paper).
d.2 Detailed examination of outliers in the citation process
For a better understanding of the disagreements between the cited/random labelling and the cosine similarity scores compared to the relevant/irrelevant labelling, we take a closer look at a FP yielded by the cosine similarity as well as a FP yielded by both, the cosine similarity and the cited/random labelling. In addition to that, in the main text we gave an example of a FN, i.e. a relevant patent that was missed by the patent examiner, but would have been found by our automated approach, as it received a high similarity score.
False positive yielded by our automated approach
The patent with ID US7585299777http://www.google.de/patents/US7585299 marked with a gray circle in Fig 9 on the left would correspond to a FP taking both human labellings as the ground truth, because it received a high cosine similarity score although being neither relevant nor a citation.
The target patent (ID US20150066086888http://www.google.de/patents/US20150066086) as well as the patent with ID US7585299 describe inventions that stabilize vertebrae. In the target patent, the described device clamps adjacent spinous processes together by two plates held together by two screws without introducing screws inside the bones. The device described in patent US7585299, in contrast, stabilizes the spine using bone anchors, which are screwed e.g. into the spinous processes or another part of the respective vertebrae and which have a clamp on the opposite end. The vocabulary in both patents is thus extremely similar, which leads to a high overlap on the BOW vector level, however, the two devices are far too different to be considered as similar inventions given that one is rigid and screwed into the bones whereas the other one only clamps the spinous processes and thereby guarantees a certain degree of flexibility.
False positive yielded by our automated approach and the cited/random labelling
For other target patents, more discordance with respect to the relevance of the other patents can be observed, also between the two human ratings. The correlation of the relevant/irrelevant scoring for the patent with ID US20150066087999http://www.google.de/patents/US20150066087 in Fig 9 on the right shows that there are many cited patents that received a rather low score by the patent attorney, which means that the patent examiner produced a considerable amount of FP. One possible explanation for this might be that the patent examiners tend to produce rather more than less citations and thus include a large amount of the patents that are returned as results for their keyword query into the search report, although, on closer inspection, the relevance for the target patent is unfounded. This is also due to the fact that they mostly base their search on the claims section, which is usually kept as general as possible to guarantee a maximum degree of protection for the invention. The analysis of the FP with ID US20130079880101010http://www.google.de/patents/US20130079880 (marked by the gray circle in the plot) underpins this hypothesis. The claims sections of the two patents are similar and the devices described in the patents are of similar construction, both having plates referred to as wings. The device described in the target patent, however, is designated to immobilize adjacent spinous processes, whereas the one described in patent US20130079880 is aimed at increasing the space between two adjacent vertebrae to relieve pressure caused for instance by dislocated discs. Especially the similar claims section might have led the patent examiner to cite the patent, although the devices clearly have different purposes, which can easily be derived from their descriptions.