Data-Centric Machine Learning in the Legal Domain

01/17/2022
by   Hannes Westermann, et al.
0

Machine learning research typically starts with a fixed data set created early in the process. The focus of the experiments is finding a model and training procedure that result in the best possible performance in terms of some selected evaluation metric. This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance of a trained deep learning classifier. We assess the overall performance (weighted average) as well as the per-class performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered. We investigate how "semantic homogeneity" of a class, i.e., the proximity of sentences in a semantic embedding space, influences the difficulty of its classification. The presented results have far reaching implications for efforts related to data collection and curation in the field of AI Law. The results also indicate that enhancements to a data set could be considered, alongside the advancement of the ML models, as an additional path for increasing classification performance on various tasks in AI Law. Finally, we discuss the need for an established methodology to assess the potential effects of data set properties.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

E-NER – An Annotated Named Entity Recognition Corpus of Legal Text

Identifying named entities such as a person, location or organization, i...
research
01/30/2019

The Wilderness Area Data Set: Adapting the Covertype data set for unsupervised learning

Benchmark data sets are of vital importance in machine learning research...
research
11/27/2017

Classifier Selection with Permutation Tests

This work presents a content-based recommender system for machine learni...
research
06/12/2023

Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence

Better understanding of Large Language Models' (LLMs) legal analysis abi...
research
12/05/2012

Making Early Predictions of the Accuracy of Machine Learning Applications

The accuracy of machine learning systems is a widely studied research to...
research
12/14/2017

Passing the Brazilian OAB Exam: data preparation and some experiments

In Brazil, all legal professionals must demonstrate their knowledge of t...
research
09/18/2023

Concurrent Haptic, Audio, and Visual Data Set During Bare Finger Interaction with Textured Surfaces

Perceptual processes are frequently multi-modal. This is the case of hap...

Please sign up or login with your details

Forgot password? Click here to reset