MIND - Mainstream and Independent News Documents Corpus

08/13/2021
by   Danielle Caled, et al.
5

This paper presents and characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources, over a 10-month period. The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories. Throughout this paper, we explain how the data collection process was conducted, and present a set of linguistic metrics that allow us to perform a preliminary characterization of the texts included in the corpus. Also, we deliver an analysis of the most frequent topics in the corpus, and discuss the main differences and similarities among the collections considered. Finally, we enumerate some tasks and applications that could benefit from this corpus, in particular the ones (in)directly related to misinformation detection. Overall, our contribution of a corpus and initial analysis are designed to support future exploratory news studies, and provide a better insight into misinformation.

READ FULL TEXT
research
05/14/2018

Bianet: A Parallel News Corpus in Turkish, Kurdish and English

We present a new open-source parallel corpus consisting of news articles...
research
05/13/2021

SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles

In this work, we introduce a corpus for satire detection in Romanian new...
research
11/12/2016

1.5 billion words Arabic Corpus

This study is an attempt to build a contemporary linguistic corpus for A...
research
12/14/2022

Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study

This paper presents a corpus annotated for the task of direct-speech ext...
research
11/06/2021

Linguistic Cues of Deception in a Multilingual April Fools' Day Context

In this work we consider the collection of deceptive April Fools' Day(AF...
research
04/01/2023

What Does the Indian Parliament Discuss? An Exploratory Analysis of the Question Hour in the Lok Sabha

The TCPD-IPD dataset is a collection of questions and answers discussed ...
research
09/02/2015

Analysis of Communication Pattern with Scammers in Enron Corpus

This paper is an exploratory analysis into fraud detection taking Enron ...

Please sign up or login with your details

Forgot password? Click here to reset