Classifying Patent Applications with Ensemble Methods

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition.



page 1

page 2

page 3

page 4


Divide and Conquer: An Ensemble Approach for Hostile Post Detection in Hindi

Recently the NLP community has started showing interest towards the chal...

Methodology and Results for the Competition on Semantic Similarity Evaluation and Entailment Recognition for PROPOR 2016

In this paper, we present the methodology and the results obtained by ou...

RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the Russian language

This paper describes the results of the first shared task on taxonomy en...

Motivations, Benefits, and Issues for Adopting Micro-Frontends: A Multivocal Literature Review

[Context] Micro-Frontends are increasing in popularity, being adopted by...

Classifying the reported ability in clinical mobility descriptions

Assessing how individuals perform different activities is key informatio...

German Dialect Identification Using Classifier Ensembles

In this paper we present the GDI_classification entry to the second Germ...

Discriminating between Indo-Aryan Languages Using SVM Ensembles

In this paper we present a system based on SVM ensembles trained on char...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to statistics of the World Intellectual Property Organization (WIPO),111 the number of patent applications filled across the world keeps growing every year. To cope with the large volume of applications, companies and organizations have been investing in the development of software to process, store, and categorize patent applications with minimum human intervention.

An important part of patent application forms is, of course, composed of text. This has led to the widespread use of NLP methods in patent application processing systems as evidenced in Section 2. One such example is the use of text classification methods to categorize patent applications according to standardized taxonomies such as the International Patent Classification (IPC)222 as discussed in the studies by benzineb2011automated,fall2003automated.

In this paper, we present a system to automatically categorize patent applications from Australia according to the top sections of the IPC taxonomy using a dataset provided by the organizers of the ALTA 2018 shared task on Classifying Patent Applications Molla and Seneviratne (2018).333 The dataset and the taxonomy are presented in more detail in Section 3. Building on our previous work Malmasi et al. (2016a); Malmasi and Zampieri (2017), our system is based on SVM ensembles and it achieved the highest performance of the competition.

2 Related Work

There have been a number of studies applying NLP and Information Retrieval (IR) methods to patent applications specifically, and to legal texts in general, published in the last few years.

Applications of NLP and IR to legal texts include the use of text summarization methods

Farzindar and Lapalme (2004) to summarize legal documents and most recently, court ruling prediction. A few papers have been published on this topic, such as the one by Katz14 which reported 70% accuracy in predicting decisions of the US Supreme Court, aletras2016predicting,medvedeva2018judicial which explored computational methods to predict decisions of the European Court of Human Rights (ECRH), and Sulea et al. (2017a, b) on predicting the decisions of the French Supreme Court. In addition to the aforementioned studies, one recent shared task has been organized on court rule prediction Zhong et al. (2018).

Regarding the classification of patent applications, the task described in this paper, a related dataset WIPO-alpha was used in the experiments and it is often used in such studies. The WIPO-alpha consists of a different number of patents (in the thousands, but it grows every year) and is usually used in its hierarchical call form Tikk and Biró (2003). Recently, word embeddings and LSTMs were applied to the task Grawe et al. (2017). There, the experiments were hierarchically conducted but in a superficial manner.

Hoffmann et al. investigated in depth the hierarchical problem of WIPO-alpha with SVMs Hofmann et al. (2003); Tsochantaridis et al. (2004); Cai and Hofmann (2007). They showed that using a hierarchical approach produced better results. Many studies showed that evaluating a hierarchical classification task is not trivial and many measures can integrate the class ontology. Still, using multiple hierarchical measures can introduce bias Brucker et al. (2011). Yet, there was much improvement in the last 3-4 years in the text classification field. This is one reason, why, when reengaging again in the WIPO-alpha dataset, investigating only the top nodes of WIPO class ontology might be a good start for future successive tasks.

Finally, at the intersection between patent applications and legal texts in general, wongchaisuwat2016 presented experiments on predicting patent litigation and time to litigation.

3 Data

The dataset released by the organizers of the ALTA 2018 shared task consists of a collection of Australian patent applications. The dataset contains 5,000 documents released for training and 1,000 documents for testing. The classes relevant for the task consisted of eight different main branches of the WIPO class ontology as follows:

  • A: Human necessities;

  • B: Performing operations, transporting;

  • C: Chemistry, metallurgy;

  • D: Textiles, paper;

  • E: Fixed constructions;

  • F: Mechanical engineering, lighting, heating, weapons, blasting;

  • G: Physics;

  • H: Electricity.

The documents were created using automated OCR and therefore, not thoroughly cleaned before release. For example, there were documents with expressions such as “NAnparse failure” and page numbers in the middle of paragraphs which made processing more challenging. We enhanced the dataset with data from the WIPO-alpha repository gathered in October 2018 consisting of 46,319 training documents and 28,924 test documents. We also took a random sub-sample of 100,000 documents from the WIPO-en gamma English dataset, which contains 1.1 million patent documents in total.

We utilized all of the available text fields in the texts and concatenated them into a single document.

4 Methodology

4.1 Preprocessing

The documents come from different sources and authors, therefore no standard representation exists and there is high variation in formatting across the documents. Since we do not utilize document structure in our approach, we decided to eliminate it by collapsing the documents into a single block of text. This was done be replacing all consecutive non-alphanumeric characters with a single space. Next, we converted the text to lowercase and removed any tokens representing numbers.

Training Public (Validation) Private (Test)
(1) Baseline 20k feats. 0.709 0.710 0.692
(2) Baseline 40k feats. 0.715 - -
(3) Baseline w/ WIPO-alpha 0.775 0.758 0.744
(4) Semi-supervised 0.734 0.728 0.704
(5) Ensemble w/ WIPO-alpha + gamma 0.787 0.776 0.778
Table 1: F1-micro performance of the systems in training (10-fold CV), in the validation and in the test sets (train, public and private leaderboard).

4.2 Features

For feature extraction we used and extended the methods reported in malmasi-zampieri:2017:VarDial1. Term Frequency (TF) of

-grams with ranging from 3 to 6 for characters and 1-2 for words have been used. Along with term frequency we calculated the inverse document frequency (TF-IDF) Gebre et al. (2013) which resulted in the best single feature set for prediction.

4.3 Classifier

We used an ensemble-based classifier for this task. Our base classifiers are linear Support Vector Machines (SVM). SVMs have proven to deliver very good performance in a number of text classification problems. It was previously used for complex word identification

Malmasi et al. (2016a), triage of forum posts Malmasi et al. (2016b), dialect identification Malmasi and Zampieri (2017), hate speech detection Malmasi and Zampieri (2018), and court ruling prediction Sulea et al. (2017a).

4.4 Systems

We developed a number of different systems. As baselines we employed single SVM models with TF-IDF, using the top 20k and 40k more frequent words as features, resulting in two models. We created a third baseline which included the WIPO-alpha data for training.

For system 4, we augmented system 3 with a semi-supervised learning approach similar to the submission by jauhiainen2018heli to the dialect identification tasks at the VarDial workshop

Zampieri et al. (2018). This approach consists of classifying the unlabelled test set with a model based on the training data, then selecting the predictions with the highest confidence and using them as new additional training samples. This approach can be very useful if there are few training samples and out-of-domain data is expected.

Finally, for system 5, we extended system 4 to be an ensemble of both word- and character-based models, and to include additional training data from the WIPO-alpha and WIPO-en gamma datasets, as described in 3.

5 Results

In this section, we investigate the impact of the different systems and data. We give special attention to the competition results showing these in different settings. This is particularly interesting since the amount of data with WIPO-alpha and the vocabulary of the ALTA data without pre-processing was relatively large.

5.1 Official Results

We present the results obtained in the training stage, the public leaderboard, and the private leaderboard in Table 1. The shared task was organized using Kaggle444

, a data science platform, in which the terms Public Leaderboard and Private Leaderboard are used referring to what is commonly understood as development or validation phase and test phase. This is important in the system development stage as it helps preventing systems from overfitting. We used 10-fold cross validation in the training setup.

As can be seen in Table 1, the ensemble system with additional data achieved the best performance. This can be attributed to the use of large amounts of additional training data, a semi-supervised approach, and an ensemble model with many features.

6 Conclusion and Future Work

This paper presented an approach to categorizing patent applications in eight classes of the WIPO class taxonomy. Our system competed in the ALTA 2018 - Classifying Patent Applications shared task under the team name BMZ. Our best system is based on an ensemble of SVM classifiers trained on words and characters. It achieved 0.778 micro-averaged F1-Score and ranked first place in the competition among 14 teams.

We observed that expanding the training data using the WIPO datasets brought substantial performance improvement. This dataset is similar to that provided by the shared task organizers in terms of genre and topics and it contains 15 times more samples. The use of an ensemble-based approach prevented the system from overfitting and providing more robust predictions.

In future work we would like to use hierarchical approaches to classify patent applications using a more fine-grained taxonomy. Finally, we would also like to investigate the performance of deep learning methods for this task.


We would like to thank the ALTA 2018 shared task organizers for organizing this interesting shared task and for replying promptly to our inquiries.


  • Aletras et al. (2016) Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios Lampos. 2016.

    Predicting Judicial Decisions of the European Court of Human Rights: A Natural Language Processing Perspective.

    PeerJ Computer Science 2:e93.
  • Benzineb and Guyot (2011) Karim Benzineb and Jacques Guyot. 2011. Automated Patent Classification. In Current challenges in patent information retrieval, Springer, pages 239–261.
  • Brucker et al. (2011) Florian Brucker, Fernando Benites, and Elena P. Sapozhnikova. 2011. An Empirical Comparison of Flat and Hierarchical Performance Measures for Multi-Label Classification with Hierarchy Extraction. In Procedings of KES Part I.
  • Cai and Hofmann (2007) Lijuan Cai and Thomas Hofmann. 2007. Exploiting known taxonomies in learning overlapping concepts. In IJCAI. volume 7, pages 708–713.
  • Fall et al. (2003) Caspar J Fall, Atilla Törcsvári, Karim Benzineb, and Gabor Karetka. 2003. Automated Categorization in the International Patent Classification. In Acm Sigir Forum. ACM, volume 37, pages 10–25.
  • Farzindar and Lapalme (2004) Atefeh Farzindar and Guy Lapalme. 2004. Legal Text Summarization by Exploration of the Thematic Structures and Argumentative Roles. Proceedings of the Text Summarization Branches Out Workshop .
  • Gebre et al. (2013) Binyam Gebrekidan Gebre, Marcos Zampieri, Peter Wittenburg, and Tom Heskes. 2013. Improving Native Language Identification with TF-IDF Weighting. In Proceedings of the BEA Workshop.
  • Grawe et al. (2017) M. F. Grawe, C. A. Martins, and A. G. Bonfante. 2017. Automated Patent Classification Using Word Embedding. In

    2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)

    . pages 408–411.
  • Hofmann et al. (2003) Thomas Hofmann, Lijuan Cai, and Massimiliano Ciaramita. 2003. Learning with taxonomies: Classifying documents and words. In NIPS workshop on syntax, semantics, and statistics.
  • Jauhiainen et al. (2018) Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén. 2018. Heli-based experiments in swiss german dialect identification. In Proceedings of the VarDial Workshop.
  • Katz et al. (2014) Daniel Martin Katz, Michael J. Bommarito II, and Josh Blackman. 2014. Predicting the behavior of the supreme court of the united states: A general approach. CoRR abs/1407.6333.
  • Malmasi et al. (2016a) Shervin Malmasi, Mark Dras, and Marcos Zampieri. 2016a. LTG at SemEval-2016 task 11: Complex Word Identification with Classifier Ensembles. In Proceedings of SemEval.
  • Malmasi and Zampieri (2017) Shervin Malmasi and Marcos Zampieri. 2017. German dialect identification in interview transcriptions. In Proceedings of the VarDial Workshop.
  • Malmasi and Zampieri (2018) Shervin Malmasi and Marcos Zampieri. 2018. Challenges in Discriminating Profanity from Hate Speech.

    Journal of Experimental & Theoretical Artificial Intelligence

  • Malmasi et al. (2016b) Shervin Malmasi, Marcos Zampieri, and Mark Dras. 2016b. Predicting Post Severity in Mental Health Forums. In Proceedings of CLPsych Workshop.
  • Medvedeva et al. (2018) Masha Medvedeva, Michel Vols, and Martijn Wieling. 2018. Judicial Decisions of the European Court of Human Rights: Looking into the Crystal Ball. Proceedings of the Conference on Empirical Legal Studies .
  • Molla and Seneviratne (2018) Diego Molla and Dilesha Seneviratne. 2018. Overview of the 2018 ALTA Shared Task: Classifying Patent Applications. In Proceedings of ALTA.
  • Sulea et al. (2017a) Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P Dinu, and Josef van Genabith. 2017a. Exploring the use of text classification in the legal domain. arXiv preprint arXiv:1710.09306 .
  • Sulea et al. (2017b) Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, and Josef van Genabith. 2017b. Predicting the Law Area and Decisions of French Supreme Court Cases. In Proceedings of RANLP.
  • Tikk and Biró (2003) Domonkos Tikk and György Biró. 2003. Experiment with a hierarchical text categorization method on the wipo-alpha patent collection. In Proceedings of ISUMA 2003.
  • Tsochantaridis et al. (2004) Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the ICML.
  • Wongchaisuwat et al. (2016) Papis Wongchaisuwat, Diego Klabjan, and John O McGinnis. 2016. Predicting Litigation Likelihood and Time to Litigation for Patents. arXiv preprint arXiv:1603.07394 .
  • Zampieri et al. (2018) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, et al. 2018. Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign. In Proceedings of VarDial Workshop.
  • Zhong et al. (2018) Haoxi Zhong, Chaojun Xiao, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, et al. 2018. Overview of CAIL2018: Legal Judgment Prediction Competition. arXiv preprint arXiv:1810.05851 .