ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

03/07/2023
by   Taja Kuzman, et al.
0

ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annotated with genres. The models are compared on test sets in two languages: English and Slovenian. Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models. Even when applied on Slovenian language as an under-resourced language, ChatGPT's performance is no worse than when applied to English. However, if the model is fully prompted in Slovenian, the performance drops significantly, showing the current limitations of ChatGPT usage on smaller languages. The presented results lead us to questioning whether this is the beginning of an end of laborious manual annotation campaigns even for smaller languages, such as Slovenian.

READ FULL TEXT
research
05/26/2023

Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks

Recently large language models (LLMs) like ChatGPT have shown impressive...
research
08/31/2021

mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

The MS MARCO ranking dataset has been widely used for training deep lear...
research
12/12/2022

Implementing Deep Learning-Based Approaches for Article Summarization in Indian Languages

The research on text summarization for low-resource Indian languages has...
research
05/22/2023

Distilling ChatGPT for Explainable Automated Student Answer Assessment

Assessing student answers and providing valuable feedback is crucial for...
research
07/31/2023

Classifying multilingual party manifestos: Domain transfer across country, time, and genre

Annotating costs of large corpora are still one of the main bottlenecks ...
research
10/06/2022

Improving Large-scale Paraphrase Acquisition and Generation

This paper addresses the quality issues in existing Twitter-based paraph...
research
05/26/2021

Deception detection in text and its relation to the cultural dimension of individualism/collectivism

Deception detection is a task with many applications both in direct phys...

Please sign up or login with your details

Forgot password? Click here to reset