Brazilian Court Documents Clustered by Similarity Together Using Natural Language Processing Approaches with Transformers

Recent advances in Artificial intelligence (AI) have leveraged promising results in solving complex problems in the area of Natural Language Processing (NLP), being an important tool to help in the expeditious resolution of judicial proceedings in the legal area. In this context, this work targets the problem of detecting the degree of similarity between judicial documents that can be achieved in the inference group, by applying six NLP techniques based on transformers, namely BERT, GPT-2 and RoBERTa pre-trained in the Brazilian Portuguese language and the same specialized using 210,000 legal proceedings. Documents were pre-processed and had their content transformed into a vector representation using these NLP techniques. Unsupervised learning was used to cluster the lawsuits, calculating the quality of the model based on the cosine of the distance between the elements of the group to its centroid. We noticed that models based on transformers present better performance when compared to previous research, highlighting the RoBERTa model specialized in the Brazilian Portuguese language, making it possible to advance in the current state of the art in the area of NLP applied to the legal sector.

READ FULL TEXT

page 4

page 5

page 6

page 10

page 11

page 12

page 13

research
09/13/2022

Pre-training Transformers on Indian Legal Text

Natural Language Processing in the legal domain been benefited hugely by...
research
11/01/2019

Finding the most similar textual documents using Case-Based Reasoning

In recent years, huge amounts of unstructured textual data on the Intern...
research
05/21/2021

Towards Automatic Comparison of Data Privacy Documents: A Preliminary Experiment on GDPR-like Laws

General Data Protection Regulation (GDPR) becomes a standard law for dat...
research
06/15/2023

Mapping Researcher Activity based on Publication Data by means of Transformers

Modern performance on several natural language processing (NLP) tasks ha...
research
12/12/2022

Drivers of the decrease of patent similarities from 1976 to 2021

The citation network of patents citing prior art arises from the legal o...
research
04/24/2023

ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

Publicly available information contains valuable information for Cyber T...
research
11/20/2018

An empirical evaluation of AMR parsing for legal documents

Many approaches have been proposed to tackle the problem of Abstract Mea...

Please sign up or login with your details

Forgot password? Click here to reset