Legal Transformer Models May Not Always Help

09/14/2021 ∙ by Saibo Geng, et al. ∙ EPFL 0

Deep learning-based Natural Language Processing methods, especially transformers, have achieved impressive performance in the last few years. Applying those state-of-the-art NLP methods to legal activities to automate or simplify some simple work is of great value. This work investigates the value of domain adaptive pre-training and language adapters in legal NLP tasks. By comparing the performance of language models with domain adaptive pre-training on different tasks and different dataset splits, we show that domain adaptive pre-training is only helpful with low-resource downstream tasks, thus far from being a panacea. We also benchmark the performance of adapters in a typical legal NLP task and show that they can yield similar performance to full model tuning with much smaller training costs. As an additional result, we release LegalRoBERTa, a RoBERTa model further pre-trained on legal corpora.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The adoption of natural language processing in the legal domain has a long history. The earliest systems for searching online legal content appeared in the 1960s and 1970s, and legal expert systems were a hot topic of discussion in the 1970s and 1980s.DALE (2019) NLP has been applied to various legal areas to automate activities, including Natural Language Understanding based Contract review, question-answering based Legal advice DALE (2019). Various Natural Language tasks can be adapted to the legal domain, including

  • Question Answering

  • Argument detection, Definition Extraction

  • Semantic Annotation

  • Classification

In 2018, the release of BERT Devlin et al. (2019) as a pre-trained language representation has achieved new state-of-the-art results on various NLP tasks. Later on, domain-adaptive pretraining(DAPT) and task-adaptive pretraining(TAPT)Gururangan et al. (2020) on the pre-trained model can further improve the results. Based on this idea, researchers have attempted to conduct DAPT with legal corpora, LegalBERT from Chalkidis et al. (2020) is a good example. Later,Liu et al. (2019) argued that BERT is largely under-trained and suggested a new setup of key hyper-parameters and training schema to produce RoBERTa. RoBERTa significantly outperforms BERT’s performance on various tasks. Following the same idea as LegalBERT, we believe a further domain adaptation with RoBERTa should produce a new state-of-the-art model in the legal domain. We present LegalRoBERTa, a RoBERTa model further pre-trained on legal corpora.

However, unsupervised pretraining is a costly action that may take more than several days, even with good computing resources. Chalkidis et al. (2020) showed that compared to BERT, legalBERT achieves only slightly better performances on three different legal NLP tasks:

  • Multilabel Classification task on EURLEX57K dataset Chalkidis et al. (2019)

  • Multilabel Classification task on ECHR-CASES dataset Chalkidis et al. (2019)

  • Named Entity Recognition on CONTRACTS-NER dataset Chalkidis et al. (2021)

Micheli et al. (2020)

reported that corpus-specific MLM was not beneficial on a French QuAD task. If the improvement from DAPT is not apparent compared to the original model, it is doubtful whether researchers should spend time collecting data and conducting DAPT every time a new language model is available. We tested the popular open-source legal language models, including LegalRoBERTa produced by ourselves, to see if a significant improvement could be observed. As we will show in the section

5, both legalBERT and legalRoBERTa demonstrated a limited boost compared to the original language model in the legal text classification task.

Wang et al. (2020) pointed out that the improvement brought by the pre-training was related to the data size of downstream tasks. The improvement is more remarkable when the downstream task is low-resource and vice-versa. As DAPT is also a particular type of pre-training; we believe that the benefit of DAPT may also have similar relationships with downstream tasks. To demonstrate this hypothesis, we tested legal language models on two different tasks: one rich resource, one low resource as well as the same task with different sizes of training data. The results showed that our hypothesis held and DAPT was especially beneficial while downstream task suffers from a lack of data.

Finally, we investigate the performance of a new NLP technique adapters in legal NLP tasks. The adapter is a more efficient way to fine-tune pre-trained language models to downstream tasks. It is faster to train and takes less space on disk. Our experimental results showed that the adapter was able to produce a comparable performance as fine-tuning the full model.

2 Contributions

The contributions of this paper are:

  1. Inspired by the idea of legalBERT and the success of RoBERTa, we present legalRoBERTa, a domain-adapted language representation for the legal area. It was pre-trained on less legal corpora than legalBERT but produced a similar performance as legalBERT.

  2. We adapted an existing legal summarization dataset Galgani and Hoffmann (2010) to a legal text retrieval task.

  3. We demonstrated that current open-source legal language models could only bring marginal benefit or no improvement on a rich resource NLP task. On the other hand, We showed that DAPT was beneficial when the downstream task was low resource.

  4. We tested the performance of adapters Houlsby et al. (2019) on a legal text classification task and show that they can produce comparable results as fine-tuning the full model.

3 Related Work

Chalkidis et al. (2020) introduced legalBERT as the first transformer adapted to the legal domain. Our work is trying to fill the gap between BERT and RoBERTa in the legal area and investigate whether DAPT is truly helpful in legal NLP tasks. Zheng et al. (2021) reported a similar conclusion on the relation of DAPT and downstream tasks. Our work has been done simultaneously with them, and we were not aware of their results until the work is mostly finished.

4 Language Model Pre-training: LegalRoBERTa

4.1 Legal Corpora Description

Corpus Size(raw) Size(clean)
Patent Litigations 1.57GB 1.1GB
Caselaw Access Project 5.6GB 2.8GB
Google Patents Public Data 1.1GB 1.0GB
Total 8.3GB 4.9GB
Table 1: Pre-trained corpora

As the first step to build a legal language model, we tried to collect public law-related corpora, but there were minimal available resources. As legal documents could contain sensitive information, institutes usually only release a small part of the data to the public, and a portion of them are in PDF format. Finally, we obtained around 5 GB of clean legal text data to proceed with the domain-adaptive pre-training.

4.2 Comparison with Other corpora

Domain Corpus #Token Size(GB)
BIOMED S2ORC 7.55B 47
CS S2ORC 8.10B 48
NEWS REALNEWS 6.66B 39
REVIEW AMAZON reviews 2.11B 11
LEGAL LEGAL-BERT - 12
LEGAL LEGAL-ROBERTA 1.01B 4.9
Table 2: Corpora for various domain

Compared to other domain adaptive pre-training experiments, our legal corpora is significantly smaller.

4.3 Pre-training Details

Following Devlin et al. (2019), we run additional pre-training steps of RoBERTa-BASE on domain-specific corpora. While Devlin et al. (2019) suggested additional steps up to 100k, our pre-training goes up to 446k as Chalkidis et al. (2020) suggests that prolonged in-domain pre-training brings a positive effect to future fine-tuning on downstream tasks.

Fine-tuning configuration:

  • learning rate = 5e-5(with learning rate decay, ends at 4.95e-8)

  • number of epochs = 3

  • Total steps = 446K

  • Total flops = 2.7365e18

  • Device: 2*GeForce GTX TITAN X (computeCapability= 5.2 )

  • RunTime 101 hours

  • Per GPU batch size = 2

Loss starts at 1.850 and ends at 0.880 The perplexity on legal corpus after domain adaptive pre-training = 2.2735

However, given limited graphical memory space, our batch size is significantly smaller than those used in similar domain adaptive pre-training. We think further pre-training on the legal corpora should be considered and may be beneficial, c.f. Table 10.1

. RoBERTa-BASE has been pre-trained for significantly more steps(1M) in generic corpora (e.g., Wikipedia, Children’s Books); thus, it is highly skewed towards generic language.

Chalkidis et al. (2020) In Appendix 10.2, we give two concrete examples to demonstrate how differently is LegalRoBERTa behaving against RoBERTa and LegalBERT from Chalkidis et al. (2020) performance on Next-Token-Prediction task.

5 Downstream Legal NLP Tasks and Model Testing

Model Authors HuggingFace url
Legal-BERT Chalkidis et al. (2020) nlpaueb/legal-bert-base-uncased
LegalRoBERTa Our paper saibo/legal-roberta-base
LegalBERT Zheng et al. (2021) zlucia/legalbert
RoBERTa-base Liu et al. (2019) roberta-base
BERT-base-unc Devlin et al. (2019) bert-base-uncased
rand-RoBERTa Our paper saibo/random-roberta-base
Table 3: Tested Language Models with HuggingFace URLs
Model Precision Recall F1 R@5 P@5 RP@5 NDCG@5
Legal-BERT 0.86 0.63 0.73 0.72 0.69 0.79 0.82
LegalRoBERTa 0.84 0.63 0.72 0.70 0.67 0.78 0.80
LegalBERT 0.86 0.61 0.71 0.71 0.68 0.78 0.81
RoBERTa-base 0.85 0.65 0.74 0.72 0.69 0.79 0.82
BERT-base-uncased 0.86 0.62 0.72 0.72 0.69 0.79 0.82
random-RoBERTa 0.85 0.59 0.69 0.69 0.66 0.76 0.79
Table 4: Results on Large-Scale Multi-Label Text Classification on EU Legislation
  • Only Legal-BERT from Chalkidis et al. (2020) has slightly outperformed original BERT. The adapted model from Zheng et al. (2021) is slightly below the original BERT, so is legalRoBERTa against original RoBERTa.

To investigate whether legal language models are better compared with normal language models. We selected six open-source language models available on HuggingFace, including two adapted BERT models from different authors: Legal-BERT from Chalkidis et al. (2020) and LegalBERT from Zheng et al. (2021), one adapted RoBERTa model from us, original BERT and RoBERTa models, and finally, a randomly initialized RoBERTa model for comparison. All six models are available via HuggingFace API. Their links on HuggingFace are listed in the Table 3.

We evaluated these models on text classification and information retrieval using two different datasets. EURLEX57K Chalkidis et al. (2019) is a large-scale multi-label text classification(LMTC) dataset of EU laws. Legal Case Reports Data Set Dua and Graff (2017)

is a dataset containing Australian legal cases from the Federal Court of Australia (FCA) during 2006-2009, which was built to experiment with automatic summarization and citation analysis.

Split Documents(D) Words/D Labels/D
Train 45k 729 5
Dev 6k 714 5
Test 6k 725 5
Total 57k 727 5
Table 5: Statistics of the EUR-LEX dataset

5.1 Large-Scale Multi-Label Text Classification on EU Legislation(rich-resource)

Experimental setup

Our dataset in this task contains 57K legislative documents from EUR-LEX, annotated with around 4.3K labels. Each document can be labeled to more than one label; thus, it is a multi-label classification task. The 4,271 labels are divided into frequent (746 labels), few-shot (3,362), and zero-shot (163), depending on whether they were assigned to more than 50, fewer than 50 but at least one, or no training documents, respectively. The model is composed of a language model as encoder and an extra classification layer on top. We use binary cross-entropy as the loss function in this task.The metrics we used in this task are identical as in the paper

Chalkidis et al. (2019).

Train data ratio Train samples Model Precision Recall F1 R@5 P@5 RP@5 NDCG@5
100% 45000 BERT 0.86 0.62 0.72 0.72 0.69 0.79 0.82
DAPT improvement +0.00 +0.00 +0.01 +0.00 +0.00 +0.00 +0.00
relative improvement(%) 0.0 0.0 1.4 0.0 0.0 0.0 0.0
20% 9000 BERT 0.66 0.19 0.29 0.39 0.35 0.43 0.46
DAPT improvement +0.04 +0.00 +0.00 0.01 0.00 0.00 +0.00
relative improvement(%) +6.1 0.0 0.0 +2.8 0.0 0.0 0.0
10% 4500 BERT 0.58 0.09 0.15 0.30 0.27 0.33 0.35
DAPT improvement +0.06 +0.02 +0.03 0.02 0.01 0.01 +0.02
relative improvement(%) +10.3 +22 +20 +6.7 +3.7 +3.0 +5.7
5% 2250 BERT 0.49 0.06 0.11 0.22 0.20 0.24 0.26
DAPT improvement +0.0 +0.01 +0.02 0.02 0.02 0.02 +0.02
relative improvement(%) 0.0 +17 +18 +9.1 +10 +8.3 +7.7
1% 450 BERT 0.00 0.00 0.00 0.03 0.02 0.03 0.02
DAPT improvement +0.0 +0.00 +0.00 0.01 0.01 0.01 +0.02
relative improvement(%) 0.0 0.0 0.0 +33 +50 +33 +100
Table 6: Improvement on LMTC with different sizes of training data

Results

As shown in the Table 4 only Legal-BERT from Chalkidis et al. (2020) has slightly outperformed the original BERT. The adapted model LegalBERT from Zheng et al. (2021) is slightly below the original BERT, so is LegalRoBERTa against original RoBERTa. This slight margin could be due to statistical error because we only ran the experiments once with a fixed random seed. However, we could not pretend that pre-trained models have demonstrated significantly better performance than the general language models. Chalkidis et al. (2020) reported similar results to us on the same task: the difference between legalBERT and BERT was not significant. One can also notice that the difference between pre-trained RoBERTa and randomly initialized RoBERTa is also relatively small. This is coherent with the argument from Wang et al. (2020)

: pretraining(or transfer learning in general) is only beneficial when there is not enough data for downstream tasks.

5.2 Automatic Catchphrase Retrieval from Legal Court Case Documents(low-resource)

Split Cases
Train 2807
Dev 350
Test 350
Total 3507
Table 7: Statistics of the AUS-CASE dataset

Experimental setup

The datasetGalgani and Hoffmann (2010) contains 3886 Australian legal cases from the Federal Court of Australia (FCA). The cases are all annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. The content which interests us most is the case sentences and catchphrases. Initially, the catchphrases were used as the gold standard for summarization experiments. We decide to adopt this dataset to a text retrieval task because BERT as a bidirectional model is not well suited for text generation tasks, including summarization. Our goal here is to retrieve the correct catchphrases based on the description of a case(sentences).

We use a supervised approach to rank and retrieve catchphrases from court case documents. Our method is inspired by VSE++: Improving Visual-Semantic Embeddings with Hard Negatives from Faghri et al. (2018). In that work, image-caption retrieval was conducted using image-caption embedding similarity rank, on which the model is optimized via the minimization of triplet ranking loss. We applied the same idea to our catchphrase retrieval task by replacing the images with case descriptions and captions with catchphrases. An essential difference between this task and the previous task is the volume of training data. Comparing the Table 7 and Table 5, the retrieval task has only training data as the text classification task.

As shown in the Figure 10.3

, the model in this task consists of an encoder (pre-trained transformer in our case) to extract features from both catchphrases and documents plus a dense neural network to transform features to representation in a shared space. The loss function in this task is the triplet loss function. Moreover, as we have only one ground true catchphrase, our task is roughly five times more difficult than the popular MS-COCO image-captioning task, where each image has five ground true captions on average.

The metric we used in this task is Recall@K (R@K), Mean Rank of the ground-true label, and median rank of the ground-true label(MeanRank and MedRanks are the lower, the better). We notice that this metric also depends on the size of the test database. For example, searching over a one-thousand cases test set is more challenging than a one-hundred cases test set. In this project, we focus on the test set of size = 389 cases.

Model R1 R5 R10 MedRank MeanRank
BERT 14.4 33.7 45.7 13.0 49.8
+-0.6 +-0.8 +-0.4 +-0.0 +-1.1
RoBERTa 13.6 33.1 44.7 14.0 55.5
+-0.6 +-0.8 +-0.9 +-0.0 +-2.9
Legal-BERT 16.8 37.7 49.9 10.6 46.9
+-1.0 +-0.5 +-0.3 +-0.6 +-1.3
LegalBERT 15.4 34.8 46.2 12.8 47.3
+-0.2 +-1.3 +-1.1 +-0.4 +-1.8
LegalRoBERTa 15.1 34.6 46.3 13.5 52.9
+-1.1 +-0.9 +-0.8 +-1 +-0.4
Table 8: Results on Automatic Catchphrase Retrieval from Legal Court Case Documents

Results

As showed in the Table 8, all the domain-pre-trained models have outperformed the corresponding original model. Legal-BERT from Chalkidis et al. (2020) is significantly better than the original BERT. This result suggested that domain-adapted models should be favored in the case of low-resource tasks.

Model Precision Recall F1 R@5 P@5 RP@5 NDCG@5
Legal-BERT 0.86 0.63 0.73 0.72 0.69 0.79 0.82
Adapter (diff) +0.01 -0.03 -0.02 -0.01 -0.01 0.0 0.0
LegalRoBERTa 0.84 0.63 0.72 0.70 0.67 0.78 0.80
Adapter (diff) +0.02 -0.04 -0.02 0.0 0.0 0.0 +0.01
LegalBERT 0.86 0.61 0.71 0.71 0.68 0.78 0.81
Adapter (diff) -0.01 0.0 0.0 0.0 0.0 +0.01 +0.01
RoBERTa-base 0.85 0.65 0.74 0.72 0.69 0.79 0.82
Adapter (diff) +0.01 -0.06 -0.04 -0.01 -0.01 0.00 -0.01
Table 9: Performance of adapters on LMTC task

6 Varying Downstream task data size

Based on the observation from the previous experiments, we further investigate the relation between domain-pretraining and downstream task data size. We vary the size of training data of LMTC tasks and compare the performance of BERT versus LegalBERT (see Table 10.5).

The results in Table 6 show that domain-pretraining’s performance improvement is tightly related to the downstream task training data size. The benefit is significant when the training data is scarce and negligible when the training data is sufficient.

7 Adapters in legal text classification

When there are multiple downstream tasks, it is time-wise inefficient to fine-tune the whole language model once per task, and it also requires much space to save the models for each task. Adapter proposed by Houlsby et al. (2019) is a good alternative to full model fine-tuning. Instead of updating all the parameters contained in the model, we add a so-called adapter module into the model. In the training step, only parameters contained in the adapter module are updated while the language model itself remains unchanged. When training is finished, one only needs to save the adapter module for each task.

This multi-task scenario is frequent in legal NLP tasks because a legal activity could be assisted by several relatively simple downstream tasks. We test the adapter module on various language models with or without domain pre-training. From the performance perspective, adapters can produce the same results as fine-tuning the whole model, cf Table 9. However, training adapters is not faster than fine-tuning the full model because the forward pass and back-propagation still have a similar amount of calculations as fine-tuning the whole model.

8 Limitations and Future Work

The size of legal corpora available restricted the pre-training of LegalRoBERTa. To better utilize the potential of RoBERTa, we should consider collecting more data such as automatic scraping. In the meantime, Chalkidis et al. (2020) has released some other legal corpora of UK and EU legislative of roughly 2.5 GB. It should be beneficial to include those data into the pre-training of LegalRoBERTa v2. Furthermore, the pre-training steps seem to be insufficient compared with other related work (see 10.1). In the task of Large-Scale Multi-Label Text Classification on EU Legislation, we evaluated models with identical hyper-parameters due to limited time and limited computing resources. A grid search of hyper-parameters and repeated experiments several times with different random seeds could be considered.

In the task of legal case retrieval, paired statistical testing can be conducted to conclude whether the domain pre-trained models are significantly better than the original models.

9 Conclusion

In this work, we first tried to answer a critical question in legal NLP from an empirical perspective: When does domain pre-training help the model to yield better performance? Through a series of legal NLP experiments, we showed that the existing three legal transformer models did not yield significant improvement on a rich-resource task while did show considerable improvement on a low-resource task or if we deliberately cut down the training data size. We therefore recommend domain pre-trained language models only in case of low-resource tasks. The second part showed that adapters, as an emerging technique, are very suitable to solve legal NLP tasks. As an intermediate result, we release LegalRoBERTa111https://huggingface.co/saibo/legal-roberta-base, a Roberta model adapted to the legal domain.

10 Acknowledgments

This work was done in cooperation with the Federal Department of Foreign Affairs(FDFA) of Switzerland and EPFL under a semester project.

References

  • Chalkidis et al. (2019) Chalkidis, I., Fergadiotis, E., Malakasiotis, P., and Androutsopoulos, I. (2019). Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6314–6322, Florence, Italy. Association for Computational Linguistics.
  • Chalkidis et al. (2020) Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. (2020). Legal-bert: The muppets straight out of law school.
  • Chalkidis et al. (2021) Chalkidis, I., Fergadiotis, M., Malakasiotis, P., and Androutsopoulos, I. (2021). Neural contract element extraction revisited: Letters from sesame street.
  • DALE (2019) DALE, R. (2019). Law and word order: Nlp in legal tech. Natural Language Engineering, 25(1):211–217.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017).

    UCI machine learning repository.

  • Faghri et al. (2018) Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. (2018). Vse++: Improving visual-semantic embeddings with hard negatives.
  • Galgani and Hoffmann (2010) Galgani, F. and Hoffmann, A. (2010). Lexa: Towards automatic legal citation classification. In Li, J., editor,

    AI 2010: Advances in Artificial Intelligence

    , volume 6464 of Lecture Notes in Computer Science, pages 445 –454. Springer Berlin Heidelberg.
  • Gururangan et al. (2020) Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks.
  • Houlsby et al. (2019) Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019). Parameter-efficient transfer learning for nlp.
  • Lee et al. (2019) Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2019). Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.
  • Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
  • Micheli et al. (2020) Micheli, V., d’Hoffschmidt, M., and Fleuret, F. (2020). On the importance of pre-training data volume for compact language models.
  • Wang et al. (2020) Wang, S., Khabsa, M., and Ma, H. (2020). To pretrain or not to pretrain: Examining the benefits of pretraining on resource rich tasks. CoRR, abs/2006.08671.
  • Zheng et al. (2021) Zheng, L., Guha, N., Anderson, B. R., Henderson, P., and Ho, D. E. (2021). When does pretraining help? assessing self-supervised learning for law and the casehold dataset. CoRR, abs/2104.08671.

Appendix

10.1 Domain Adaptive Pre-training Details of Related Work

In the table below, we can see the Domain Adaptive Pre-training details of related work.

Experiment Authors Step # epoch batch size
Various Domains Gururangan et al. (2020) 12.5K 1 256
LegalBERT Chalkidis et al. (2020) 1000K 40 256
BioBERT Lee et al. (2019) 200K-470K - 192
LegalROBERTa Our paper 446K 3 4

10.2 Examples of Next-Token-Prediction Results of LegalRoBERTa

This {mask} Agreement is between General Motors and John Murray .

Model Top1 Top2 Top3
BERT new current proposed
LegalBERT settlement letter dealer
LegalROBERTa License Settlement Contract
Table 10: Next Token Prediction Example 1

The applicant submitted that her husband was subjected to treatment amounting to {mask} whilst in the custody of Adana Security Directorate

Model Top1 Top2 Top3
BERT torture rape abuse
LegalBERT torture detention arrest
LegalROBERTa torture abuse insanity
Table 11: Next Token Prediction Example 2

10.3 Model Architecture in Catchphrase Retrieval task

Below is an illustration of the Automatic Catchphrase Retrieval from Legal Court Case Documents described in Section 5.2.

10.4 Hyper-parameters in LMTC task

  1. lr: 3e-05

  2. random seed: 0

  3. batch size: 16

  4. max sequence size: 216

  5. epochs: 40

  6. dropout:0.1

  7. early stop:yes

  8. patience: 7

10.5 Results on LMTC with different sizes of training data

The results on LMTC for BERT and LegalBERT with different sizes of training data are shown in the table below.

Training data ratio Train samples Model Precision Recall F1 R@5 P@5 RP@5 NDCG@5
100% 45000 BERT 0.86 0.62 0.72 0.72 0.69 0.79 0.82
LegalBERT 0.86 0.63 0.73 0.72 0.69 0.79 0.82
20% 9000 BERT 0.66 0.19 0.29 0.39 0.35 0.43 0.46
LegalBERT 0.70 0.19 0.29 0.40 0.35 0.43 0.46
10% 4500 BERT 0.58 0.09 0.15 0.30 0.27 0.33 0.35
LegalBERT 0.64 0.11 0.18 0.32 0.28 0.34 0.37
5% 2250 BERT 0.49 0.06 0.11 0.22 0.20 0.24 0.26
LegalBERT 0.49 0.07 0.13 0.24 0.22 0.26 0.28
1% 450 BERT 0.00 0.00 0.00 0.03 0.02 0.03 0.02
LegalBERT 0.00 0.00 0.00 0.04 0.03 0.04 0.04