Klexikon: A German Dataset for Joint Summarization and Simplification

01/18/2022
by   Dennis Aumiller, et al.
0

Traditionally, Text Simplification is treated as a monolingual translation task where sentences between source texts and their simplified counterparts are aligned for training. However, especially for longer input documents, summarizing the text (or dropping less relevant content altogether) plays an important role in the simplification process, which is currently not reflected in existing datasets. Simultaneously, resources for non-English languages are scarce in general and prohibitive for training new solutions. To tackle this problem, we pose core requirements for a system that can jointly summarize and simplify long source documents. We further describe the creation of a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon", consisting of almost 2900 documents. We release a document-aligned version that particularly highlights the summarization aspect, and provide statistical evidence that this resource is well suited to simplification as well. Code and data are available on Github: https://github.com/dennlinger/klexikon

READ FULL TEXT
research
05/30/2023

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Text simplification is an intralingual translation task in which documen...
research
01/17/2023

On the State of German (Abstractive) Text Summarization

With recent advancements in the area of Natural Language Processing, the...
research
09/02/2022

A New Aligned Simple German Corpus

"Leichte Sprache", the German counterpart to Simple English, is a regula...
research
02/12/2021

SumeCzech: Large Czech News-Based Summarization Dataset

Document summarization is a well-studied NLP task. With the emergence of...
research
04/19/2022

I still have Time(s): Extending HeidelTime for German Texts

HeidelTime is one of the most widespread and successful tools for detect...
research
09/16/2023

ODSum: New Benchmarks for Open Domain Multi-Document Summarization

Open-domain Multi-Document Summarization (ODMDS) is a critical tool for ...
research
08/19/2023

Evaluating Transfer Learning for Simplifying GitHub READMEs

Software documentation captures detailed knowledge about a software prod...

Please sign up or login with your details

Forgot password? Click here to reset