EuroVoc111https://data.europa.eu/data/datasets/eurovoc is a multilingual thesaurus which was originally built up specifically for processing the documentary information of the EU institutions. The covered fields are encompassing both European Union and national points of view, with a certain emphasis on parliamentary activities. The current release 4.4 of EuroVoc was published in December 2012 and includes 6,883 IDs for thesaurus concepts (corresponding to the preferred terms), classified into 21 domains (top-level domains), further refined into 127 subdomains. Additional forms of the preferred terms are also available and are assigned the same ID, subdomains and top-level domains.
Multilingual EuroVoc thesaurus descriptors are used by a large number of European Parliaments and Documentation Centres to index their large document collections. The assigned descriptors are then used to search and retrieve documents in the collection and to summarise the document contents for the users. As EuroVoc descriptors exist in one-to-one translations in almost thirty languages, they can be displayed in a language other than the text language and give users cross-lingual access to the information contained in each document.
One of the most successful recent approaches in document and text classification involves fine-tuning large pretrained language models on a specific task adhikari2019docbert; nikolov2019nikolov. Thus, in this work we propose a tool for classifying legal documents with EuroVoc descriptors that uses various flavours of Bidirectional Encoder from Transformers (BERT) devlin2019bert, specific to each language. We evaluated the performance of our models for each individual language and show that our models obtain a significant improvement over a similar tool - JEX steinberger2012jrc. The The models were further integrated into the RELATE platform pais2020 and an API was provided through the PythonPackage Index (PyPi) interface222https://pypi.org/project/pyeurovoc/ that facilitates the classification of new documents. The code used to train and evaluate the models was also open-sourced333https://github.com/racai-ai/pyeurovoc.
The rest of the paper is structured as follows. Section 2 presents other works in the direction of EuroVoc classification. Section 3 provides several statistics with regard to the corpus used to train and test the models in the tool. Section 4 presents the approach used in fine-tuning the pretrained language models and the exact BERT variants used for each language, together with a vocabulary statistics of the model’s tokenizer on the legal dataset. Section 5 outlines our evaluation setup and the results of our experiments, while Section 6 presents the programmatic interface. Finally, the paper is concluded in Section 7.
2 Related Work
JEX steinberger2012jrc is a multi-label classification software developed by Joint Research Centre (JRC), that was trained to assign EuroVoc descriptors to documents that cover the activities of the EU. It was written entirely in Java and it comes with 4 scripts (both batch and bash) that allows a user to pre-process a set of documents, train a model, postprocess the results and evaluate a model. Each script is easily configurable from a properties file that contains most of the necessary parameters. The toolkit also comes with a graphical interface that allows a user to easily label a set of new documents (in plain text, XML or HTML) or to train a classifier on their own document collections.
The algorithm used for classification was described in pouliquen2006automatic and it consists in producing a list of lemma frequencies from normalized text, and their weights, that are statistically related to each descriptor, entitled in the paper as associates or as topic signatures. Then, to classify a new document, the algorithm picks the descriptors of the associates that are the most similar to the list of lemma frequencies of the new document. The initial release consisted of 22 pretrained classifiers, each corresponding to an official EU language.
Boella et al. boella2012multi, while focusing on the Italian JRCAcquis-IT corpus, presents a technique for transforming multi-label data into mono-label that is able to maintain all the information as in tsoumakas2007multi
, allowing the use of approaches like Support Vector Machines (SVM)joachims1998text for classification. Their proposed method allows an F1 score of 58.32 (an increase of almost 8% compared to the JEX score of 50.61 for the Italian language).
Šarić et al. vsaric2014multi further explores SVM approaches for classification of Croatian legal documents and report an F1 score of 68.6. Unfortunately, this is not directly comparable with the JEX reported results since the training corpus is a different collection called NN13205. Furthermore, the categories being used for the gold annotation represent an extended version of EuroVoc for Croatian, called CroVoc.
Studies, such as in collobert2011natural
, have shown that neural word embeddings can store abundant semantic meanings and capture multi-aspect relations into a real-valued matrix, when trained on large unlabeled corpora using neural networks. Considering a vocabulary
, an embeddings representation can be learned by means of a neural network resulting into an association of a real-valued vectorof size
to each word. Two neural network methods for automatically learning distributed representation of words from a large text corpus can be con-sidered: Skip-gram and continuous bag of words (CBOW)DBLP:journals/corr/abs-1301-3781. In the case of CBOW, a neural network is trained to predict the middle word given a context, while Skip-gram uses a single word as input and tries to predict past and future words. Bojanowski et al. bojanowski2017enriching
introduced a method for runtime representation for unknown words by means of averaging pre-trained character n-grams, also known as subword information.
BERT has also been used to classify legal documents with EuroVoc labels, with most of the work focusing on the English language. In chalkidis2019large, the authors studied the problem of Large-Scale Multi-Label Text Classification (LMTC) for few- and zero-shot learning and released a new dataset composed of 57k samples from EUROLEX on which several models were tested. The results showed that BERT obtained superior performance in all but zero-shot classification.
3 Dataset Statistics
The training of BERT models for the 22 languages was done using the same dataset that was used for training the JEX models. The dataset is composed of two parallel corpora from the legal domain, JRC-Acquis steinberger2006jrc and the Publications Office of the European Union (OPOCE), that were manually labeled with over 6,700 EuroVoc descriptors identifiers (ID). The EuroVoc descriptors are hierarchically organised and can be converted into higher level Microthesaurus labels (MT) and further into top-level domains (DO).
The number of documents in the dataset range from 17,858 documents in Maltese to 41,989 documents in French. Each document is labeled with multiple ID descriptors, having an average of 6 ID descriptors which can be equivalently converted to 5 MT descriptors or 4 DO descriptors. In Figure 1 we depict the distribution of the average number of ID, MT and DO descriptors per document, together with the difference between the minimum and the maximum number of documents per descriptor.
The ID, MT and DO descriptors distributions are also highly unbalanced. Figure 2 depicts the number of documents of the most frequent ID, MT and DO descriptors, organised in groups of 50, 5 and 1, respectively. Each group contains the sum of the number of documents that are labeled with each descriptor in the respective group. As it can be observed, in each subplot, the number of documents that contain the descriptors from the first few groups is higher than the number of document that contain all the other descriptors.
The proposed approach for classifying the legal documents found in the two corpora is to fine-tune a pre-trained BERT on each of the 22 languages. We follow the method introduced in devlin2019bert where a simple feed-forward network with the weights , is embedding size of BERT and is the number of classes, and bias is put on top of the embedding of the first token ([CLS]
) to create the output logits of the classification problem for the ID descriptors444The MT and DO descriptors are predicted by using a direct mapping scheme. The sigmoid
function is then applied to produce independent probability distributionsover each class:
Additionally, a dropout of 0.1 is applied on the feed-forward layer to regularize the model.
The models are optimized by reducing the loss
computed as the average binary cross-entropy between the output probabilitiesand the target classes , over the classes (ID descriptors).:
Because the flavours of BERTs vary from one language to another, the choice of the initial models for each language was made by using the following heuristic, based on the corpora used for pretraining:Legal Monolingual (Mono) Wikipedia (Wiki) Multilingual (Multi). The heuristic is experimentally supported by chalkidis2020legal that showed that language models obtain superior performance on the legal domain when they are pretrained on legal corpora and by pyysalo2020wikibert that outlined the superiority of BERTs pretrained on monolingual Wikipedia over multilingual BERT (mBERT). Also, it was empirically proven that the performance of the language models improves when they are pretrained on larger corpora liu2019roberta and for this reason we expect most of the general monolingual models to obtain better result than Wikipedia BERTs. Thus, given the existing open-sourced models for each language, we use the following taxonomy in our experiments: (Figure 3) 555To the best of our knowledge, not all languages have publications for their monolingual versions of BERT, so we attached a corresponding URL in these cases.:
Legal: English (en) - Legal BERT chalkidis2020legal.
Mono: Danish (da) - Danish BERT 666https://github.com/botxo/nordic_bert, German (de) - German BERT777https://huggingface.co/bert-base-german-cased, Greek (el) - Greek BERT koutsikakis2020greek, Spanish (es) - Spanish BERT canete2020spanish, Estonian (et) - EstBERT tanvir2020estbert, Finnish (fi) - Finnish BERT virtanen2019multilingual, French (fr) - CamemBERT martin2020camembert, Hungarian (hu) - huBERT erzsebetdavid, Italian (it) - Italian BERT 888https://huggingface.co/dbmdz/bert-base-italian-cased, Dutch - BERTje de2019bertje, Polish (pl) - PolBERT Kleczek2020, Portuguese (pt) - BERTimabausouza2020bertimbau, Romanian (ro) - Romanian BERT dumitrescu2020birth, Swedish (sv) - Swedish BERT swedish-bert.
Wiki: Bulgarian (bg) - WikiBERT-BG, Czech (cs) - WikiBERT-CS, Lithuanian (lt) - WikiBERT-LT, Latvian (lv) - WikiBERT-LV, Slovak (sk) - WikiBERT-SK, Slovene (sl) - WikiBERT-SL.
Multi: Maltese (mt) 999https://huggingface.co/bert-base-multilingual-cased.
|Danish BERT (da)||1.51||6e-3|
|German BERT (de)||1.64||1e-3|
|Greek BERT (el)||1.44||8e-5|
|Legal BERT (en)||1.28||3e-4|
|Spanish BERT (es)||1.25||6e-3|
|Finnish BERT (fi)||1.72||1e-3|
|Italian BERT (it)||1.36||2e-4|
|Romanian BERT (ro)||2.31||1e-4|
|Swedish BERT (sv)||1.45||5e-4|
The vocabulary of BERT plays an important role in the final performance of the model. Broadly speaking, the fewer tokens each word is split into, the better the language model is expected to perform. In Table 1 we depict the average number of tokens per word and the average number of unknown (UNK) tokens per word on the legal dataset for each tokenizer of the 22 BERT models. As it can be observed, the lowest number of tokens per word is achieved by the Spanish BERT with 1.25 followed closely by the Legal BERT with 1.28. When looking at unknown tokens per word, CammemBERT tokenizer leads the leaderboard with no unknown words when tokenizing the dataset. On the other hand, the highest number of tokens and unknown tokens per word was achieved on Maltese due to use of mBERT instead of a monolingual model.
The legal documents in the corpus can be rather long and exceed the maximum number of tokens of 512 allowed by the BERT models. To mitigate this, we simply keep only the first 512 in the document and discard the rest. This method has been shown to lead to approximately the same performance as considering the whole document chalkidis2019large.
|Language||ID F1@6||MT F1@5||DO F1@4||ID F1@6||MT F1@5||DO F1@4|
5.1 Evaluation Setup
Because the original splits used for training and evaluating the JEX models were not made publicly available, we united the JRC-Acquis and OPOCE datasets for each language and split it 5 times in train, validation and test sets using different seeds. Moreover, in order to preserve the class balance across the sets in one split, we employed an iterative stratification splitting approach as proposed in sechidis2011stratification and kept an approximate ratio of 80% train, 10% validation and 10% test for fine-tuning and evaluating the pre-trained language models and a ratio of 90% train and 10% test for training and evaluating the JEX models.
The pre-trained language models were fine-tuned for 30 epochs, using a batch size of 8 and the AdamW optimizerloshchilov2018decoupled whose learning rate was decayed by a linear scheduler peaking at 6e-5, in order to reduce the oscillations in the later stages of training due to the high values of the learning rate. We also clipped the gradients pascanu2013difficulty whose norm had a value over 5 and used a learning rate warm-up over the first epoch to alienate the effects of forgetting the knowledge learned by Transformer models in the pre-training phase. The final weights of each fine-tuned language model were the ones that obtained the lowest loss on the validation set during training.
The training and evaluation of JEX models followed the approach described in steinberger2012jrc
. Both JEX models and the pre-trained language models were trained five times on each split with the results averaged over all test splits. We also used the validation splits for early stopping and fine-tune the hyperparameters of the BERT models.
5.2 Evaluation Metrics
Most used metrics for evaluating LMTC models are are the precision (P@K), the recall (R@K
) and their harmonic mean, known as F1 score (F1@K), over the top predicted labels. These metrics usually unfairly penalize documents that have fewer or more labels than , but we still use them because they allow a direct comparison with the original results of JEX. The three metrics are defined as follows:
where is the number of labels to be used for comparison, is the number of true labels of the respective document, is the vector of the true labels, is the vector of predicted labels and is a function that selects the index of the th largest value in the prediction labels.
As the statistics in Section 3 have shown, the average number ,per document, of ID descriptors is 6, of MT descriptors is 5 and of DO descriptors is 4. Thus, we evaluate both JEX and BERT by using the F1 score for 6 labels on ID descriptors (F1@6), for 5 labels on MT descriptors (F1@5) and for 4 labels on DO descriptors (F1@4).
The results for both JEX and the BERT models on the 22 languages by using the cross-validated dataset are outlined in Table 2. The BERT models obtained a significant improvement over JEX on each language, ranging from an enhancement of 21.54% (el), 14.85% (fr) and 9.65% (el) to an enhancement of 37.06% (sk), 25.94% (ro) and 19.83% (sl) for ID, MT and DO descriptors, respectively. The highest F1 scores with JEX were achieved on German with 50.65% F1@6, 63.15% F1@5, 72.23% F1@4, and the highest F1 scores with the BERT models were achieved on Slovenian with 84.90% F1@6, 87.37% F1@5, 91.72% F1@4 for ID, MT and DO descriptors, respectively. On the other hand, the lowest scores were obtained on Maltese. This might be due to the low number of documents compared to the other languages steinberger2012jrc, but also because, in the case of the BERT variant, we use a multilingual model instead of a monolingual one.
Figure 4 depicts the F1@6-scores obtained by the BERT models on multi-label ID classification relative to the scores obtained by JEX models in the same language. One interesting aspect that can be observed in the plot is that although the mBERT used for Maltese obtained the lowest F1-score, its relative improvement over JEX is higher than of the other three languages that use monolingual models: Greek, Hungarian and French, mostly because the F1@6-score obtained by JEX on Maltese is lower when compared to the other three and thus the difference would normally be larger.
The variance of scores between languages is higher for the BERT models101010BERT high-low differences: ID - 15.84%, MT - 15.22%, DO - 10.26%. than for the JEX models111111JEX high-low differences: ID - 6.29%, MT - 7.16%, DO - 6.12%. This happens because the BERT models were pretrained beforehand on various corpora taken from different sources and aspects like the quality of the corpus or the domain match greatly influenced the resulted fine-tuning performance. One interesting result is that WikiBERT obtained some of the highest scores and that Legal BERT also did not perform as well as expected, thus partially contradicting the heuristic introduced in the previous section. Due to time and resource constraints, we leave a detailed study of the heuristic for future work.
5.4 Comparison with State-of-the-Art
The state-of-the-art (SOTA) for EuroVoc multi-label classification was presented in chalkidis2019large by using the original BERT-base. The model was trained and evaluated on EUR-LEX, an English corpus introduced in the same paper. We evaluated our English model (Legal BERT), trained on JRC-Acquis and OPOCE, and report in Table 3 the R-Precision (RP), the normalized discounted cumulative gain (nDCG) and the F1-micro score for extracting 5 ID descriptors on the test set of EUR-LEX. Our model obtained a RP@5 of 81.2%, a nDCG@5 of 83.4% and a micro-F1 of 79.6, outperforming CNN-LWAN, BIGRU-LWAN mullenbach2018explainable, BIGRU-LWAN-L2V chalkidis2019large and BERT-base. It must be noted that this comparison is not entirely correct because our model was trained on a different corpus which might affect the final results. However, it allows to glimpse the performance of our system in contrast with more modern approaches.
|Legal BERT (ours)||81.2||83.4||79.6|
Other extensive document classification experiments with BERT were conducted by adhikari2019docbert, without specific consideration to EuroVoc labels. They used a similar approach to ours by introducing a fully-connected layer over the embedding of the first token [CLS]. Furthermore, the paper also presents the results for a knowledge distillation process hinton2015distilling from the fine-tuned BERT-large into the previous SOTA adhikari2019rethinking, a much smaller network, obtaining better results than BERT-base on the evaluated datasets, but still behind BERT-large.
5.5 Response Time
The response time of the API was tested on a CPU - Intel Xeon Silver 4210 - and on a GPU - Nvidia Quadro RTX 5000. Because the pretrained language models have mostly the same dimension, we made an inference time analysis only for the English variant. Figure 5 depicts the average response time of Legal BERT using various sequence lengths. The response time of the model on GPU is approximately 34 ms on the GPU, with a slight increase to 43 ms when the maximum sequence length of 512 is given as input. However, when the API is run on the CPU the response time increases from 100 ms to 450 ms.
6 Programmatic Interface
To ease the loading of models and the classification of documents, we created a programmatic interface in Python that can be installed using PyPi with the command pip install pyeurovoc. Once the library is installed, a BERT model is simply created by instantiating the class EuroVocBERT with one of the 22 languages. The class will either download the fine-tuned model from the repository or will use a local cached version of it. Finally, the classification of a document is made by calling the instantiated model with the document text.
More detailed information about the API and how custom pre-trained BERT models can be fine-tuned on the dataset can be found at the source repository. An example of API usage is presented in Appendix A.
7 Conclusion and Future Work
Document classification remains a relevant problem in nowadays society, aiding companies and government institutions to index their large textual database. This paper presented a tool for classifying legal documents with EuroVoc descriptors that use various Transformer-based language models, fine-tuned on the 22 languages that are found in JRC-Acquis and OPOCE. We thoroughly evaluated the models on multiple splits of the data and the results showed that they significantly improve the performance obtained by another similar tool - JEX. The pretrained models were made publicly available and they can be easily used to classify new documents using our API.
One direction for possible future work is to improve the inference speed of the models by either distilling their knowledge in a smaller network hinton2015distilling or quantizing their weights yang2019quantization. Furthermore, we intend to include our results for legal document classification in language specific NLP benchmarks such as KLEJ for Polish rybak2020klej, LiRo for Romanian dumitrescu2021liro or EVALITA4ELG for Italian patti2020evalita4elg.
This research was supported by the EC grant no. INEA/CEF/ICT/A2017/1565710 for the Action no. 2017-EU-IA-0136 entitled “Multilingual Resources for CEF.AT in the legal domain” (MARCELL).
Appendix A API Code Snippet
The following is a code snippet that loads the BERT model for English from the checkpoint repository and classifies a document, given its text.
The code sniped will return a dictionary of ID descriptors and confidence scores. The number of labels returned by the model for ID descriptors type is controlled by the num_labels.