Datasets for Portuguese Legal Semantic Textual Similarity: Comparing weak supervision and an annotation process approaches

05/29/2023
by   Daniel da Silva Junior, et al.
0

The Brazilian judiciary has a large workload, resulting in a long time to finish legal proceedings. Brazilian National Council of Justice has established in Resolution 469/2022 formal guidance for document and process digitalization opening up the possibility of using automatic techniques to help with everyday tasks in the legal field, particularly in a large number of texts yielded on the routine of law procedures. Notably, Artificial Intelligence (AI) techniques allow for processing and extracting useful information from textual data, potentially speeding up the process. However, datasets from the legal domain required by several AI techniques are scarce and difficult to obtain as they need labels from experts. To address this challenge, this article contributes with four datasets from the legal domain, two with documents and metadata but unlabeled, and another two labeled with a heuristic aiming at its use in textual semantic similarity tasks. Also, to evaluate the effectiveness of the proposed heuristic label process, this article presents a small ground truth dataset generated from domain expert annotations. The analysis of ground truth labels highlights that semantic analysis of domain text can be challenging even for domain experts. Also, the comparison between ground truth and heuristic labels shows that heuristic labels are useful.

READ FULL TEXT

page 4

page 5

research
01/30/2020

An Automated Framework for the Extraction of Semantic Legal Metadata from Legal Texts

Semantic legal metadata provides information that helps with understandi...
research
09/26/2022

Legal Case Document Similarity: You Need Both Network and Text

Estimating the similarity between two legal case documents is an importa...
research
02/28/2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...
research
11/16/2021

Who Decides if AI is Fair? The Labels Problem in Algorithmic Auditing

Labelled "ground truth" datasets are routinely used to evaluate and audi...
research
09/06/2017

On-the-fly Historical Handwritten Text Annotation

The performance of information retrieval algorithms depends upon the ava...
research
09/08/2023

Linking Symptom Inventories using Semantic Textual Similarity

An extensive library of symptom inventories has been developed over time...
research
02/12/2022

Detecting False Alarms from Automatic Static Analysis Tools: How Far are We?

Automatic static analysis tools (ASATs), such as Findbugs, have a high f...

Please sign up or login with your details

Forgot password? Click here to reset