A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

04/06/2022
by   Miriam Schirmer, et al.
0

Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one's research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year's hot topic on Language Technology for All.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2017

A Large Self-Annotated Corpus for Sarcasm

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for...
research
01/09/2022

Indian Language Wordnets and their Linkages with Princeton WordNet

Wordnets are rich lexico-semantic resources. Linked wordnets are extensi...
research
01/27/2020

SemClinBr – a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

The high volume of research focusing on extracting patient's information...
research
11/21/2022

Deanthropomorphising NLP: Can a Language Model Be Conscious?

This work is intended as a voice in the discussion over the recent claim...
research
03/24/2023

MUG: A General Meeting Understanding and Generation Benchmark

Listening to long video/audio recordings from video conferencing and onl...
research
08/30/2023

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Kyrgyz is a very underrepresented language in terms of modern natural la...
research
03/24/2023

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) ...

Please sign up or login with your details

Forgot password? Click here to reset