Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training

05/22/2023
by   Miriam Anschütz, et al.
0

Automatic text simplification systems help to reduce textual information barriers on the internet. However, for languages other than English, only few parallel data to train these systems exists. We propose a two-step approach to overcome this data scarcity issue. First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German. Then, we used these models as decoders in a sequence-to-sequence simplification task. We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts. Moreover, with the style-specific pre-training, we reduced the number of trainable parameters in text simplification models. Hence, less parallel data is sufficient for training. Our results indicate that pre-training on unaligned data can reduce the required parallel data while improving the performance on downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2020

Pre-Training a Language Model Without Human Language

In this paper, we study how the intrinsic nature of pre-training data co...
research
09/13/2023

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Vision and Language Models (VLMs), such as CLIP, have enabled visual rec...
research
08/05/2022

Construction of English Resume Corpus and Test with Pre-trained Language Models

Information extraction(IE) has always been one of the essential tasks of...
research
12/17/2020

Benchmarking Automatic Detection of Psycholinguistic Characteristics for Better Human-Computer Interaction

When two people pay attention to each other and are interested in what t...
research
02/04/2021

One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Unsupervised word representation learning from large corpora is badly ne...
research
02/22/2023

Learning from Multiple Sources for Data-to-Text and Text-to-Data

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert st...
research
12/20/2022

ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models

State-of-the-art poetry generation systems are often complex. They eithe...

Please sign up or login with your details

Forgot password? Click here to reset