Compiling and Processing Historical and Contemporary Portuguese Corpora

10/02/2017
by   Marcos Zampieri, et al.
0

This technical report describes the framework used for processing three large Portuguese corpora. Two corpora contain texts from newspapers, one published in Brazil and the other published in Portugal. The third corpus is Colonia, a historical Portuguese collection containing texts written between the 16th and the early 20th century. The report presents pre-processing methods, segmentation, and annotation of the corpora as well as indexing and querying methods. Finally, it presents published research papers using the corpora.

READ FULL TEXT
research
05/24/2022

Charon: a FrameNet Annotation Tool for Multimodal Corpora

This paper presents Charon, a web tool for annotating multimodal corpora...
research
12/14/2014

Tools for Terminology Processing

Automatic terminology processing appeared 10 years ago when electronic c...
research
06/13/2023

Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts

The increasing availability of digital collections of historical and con...
research
09/05/2019

Common Library 1.0: A Corpus of Victorian Novels Reflecting the Population in Terms of Publication Year and Author Gender

Research in 19th-century book history, sociology of literature, and quan...
research
01/06/2020

Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

Travelogues represent an important and intensively studied source for sc...
research
10/28/2020

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

This paper outlines the creation of three corpora for multilingual compa...
research
09/30/2016

Modeling Language Change in Historical Corpora: The Case of Portuguese

This paper presents a number of experiments to model changes in a histor...

Please sign up or login with your details

Forgot password? Click here to reset