A Survey of Recent Abstract Summarization Techniques

04/15/2021
by   Diyah Puspitaningrum, et al.
0

This paper surveys several recent abstract summarization methods: T5, Pegasus, and ProphetNet. We implement the systems in two languages: English and Indonesian languages. We investigate the impact of pre-training models (one T5, three Pegasuses, three ProphetNets) on several Wikipedia datasets in English and Indonesian language and compare the results to the Wikipedia systems' summaries. The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization. The most significant factors that influence ROUGE performance are coverage, density, and compression. The higher the scores, the better the summary. Other factors that influence the ROUGE scores are the pre-training goal, the dataset's characteristics, the dataset used for testing the pre-trained model, and the cross-lingual function. Several suggestions to improve this paper's limitation are: 1) assure that the dataset used for the pre-training model must sufficiently large, contains adequate instances for handling cross-lingual purpose; 2) Advanced process (finetuning) shall be reasonable. We recommend using the large dataset consists of comprehensive coverage of topics from many languages before implementing advanced processes such as the train-infer-train procedure to the zero-shot translation in the training stage of the pre-training model.

READ FULL TEXT
research
05/16/2023

Towards Unifying Multi-Lingual and Cross-Lingual Summarization

To adapt text summarization to the multilingual world, previous work pro...
research
10/18/2020

Mixed-Lingual Pre-training for Cross-lingual Summarization

Cross-lingual Summarization (CLS) aims at producing a summary in the tar...
research
10/07/2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

We introduce WikiLingua, a large-scale, multilingual dataset for the eva...
research
05/30/2022

X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

The number of scientific publications nowadays is rapidly increasing, ca...
research
02/11/2022

ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization

We present ClidSum, a benchmark dataset for building cross-lingual summa...
research
09/13/2021

Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training

The goal of stance detection is to determine the viewpoint expressed in ...
research
06/26/2020

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned wit...

Please sign up or login with your details

Forgot password? Click here to reset