FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection

04/10/2021
by   Radu Tudor Ionescu, et al.
0

In this paper, we introduce FreSaDa, a French Satire Data Set, which is composed of 11,570 articles from the news domain. In order to avoid reporting unreasonably high accuracy rates due to the learning of characteristics specific to publication sources, we divided our samples into training, validation and test, such that the training publication sources are distinct from the validation and test publication sources. This gives rise to a cross-domain (cross-source) satire detection task. We employ two classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (average of CamemBERT word embeddings). As an additional contribution, we present an unsupervised domain adaptation method based on regarding the pairwise similarities (given by the dot product) between the training samples and the validation samples as features. By including these domain-specific features, we attain significant improvements for both character n-grams and CamemBERT embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/11/2021

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large Romanian Sentiment Data Set

Romanian is one of the understudied languages in computational linguisti...
research
12/19/2019

Mislabel Detection of Finnish Publication Ranks

The paper proposes to analyze a data set of Finnish ranks of academic pu...
research
02/28/2019

Adversarial Training for Satire Detection: Controlling for Confounding Variables

The automatic detection of satire vs. regular news is relevant for downs...
research
03/05/2019

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite...
research
01/19/2019

MOROCO: The Moldavian and Romanian Dialectal Corpus

In this work, we introduce the MOldavian and ROmanian Dialectal COrpus (...
research
06/16/2020

DSDANet: Deep Siamese Domain Adaptation Convolutional Neural Network for Cross-domain Change Detection

Change detection (CD) is one of the most vital applications in remote se...
research
12/15/2022

FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

We present a novel corpus for French dialect identification comprising 4...

Please sign up or login with your details

Forgot password? Click here to reset