Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate

08/19/2023
by   Michela Lorandi, et al.
0

LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and 4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20.

READ FULL TEXT
research
04/05/2020

Machine Translation Pre-training for Data-to-Text Generation – A Case Study in Czech

While there is a large body of research studying deep learning methods f...
research
03/18/2020

Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training

West African Pidgin English is a language that is significantly spoken i...
research
08/13/2021

MTG: A Benchmarking Suite for Multilingual Text Generation

We introduce MTG, a new benchmark suite for training and evaluating mult...
research
04/09/2019

Bilingual-GAN: A Step Towards Parallel Text Generation

Latent space based GAN methods and attention based sequence to sequence ...
research
02/19/2021

Multilingual Augmenter: The Model Chooses

Natural Language Processing (NLP) relies heavily on training data. Trans...
research
01/22/2020

Normalization of Input-output Shared Embeddings in Text Generation Models

Neural Network based models have been state-of-the-art models for variou...
research
09/20/2023

Prototype of a robotic system to assist the learning process of English language with text-generation through DNN

In the last ongoing years, there has been a significant ascending on the...

Please sign up or login with your details

Forgot password? Click here to reset