Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has 653,322,577 tokens, distributed over 7 broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.

READ FULL TEXT
research
03/15/2017

InScript: Narrative texts annotated with script information

This paper presents the InScript corpus (Narrative Texts Instantiating S...
research
12/19/2018

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely po...
research
01/23/2018

HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments

The science of happiness is an area of positive psychology concerned wit...
research
09/29/2021

StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narrativ...
research
01/26/2021

A Digital Corpus of St. Lawrence Island Yupik

St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysyntheti...
research
05/20/2016

As Cool as a Cucumber: Towards a Corpus of Contemporary Similes in Serbian

Similes are natural language expressions used to compare unlikely things...
research
12/04/2019

Towards Constructing a Corpus for Studying the Effects of Treatments and Substances Reported in PubMed Abstracts

We present the construction of an annotated corpus of PubMed abstracts r...

Please sign up or login with your details

Forgot password? Click here to reset