AMALGUM – A Free, Balanced, Multilayer English Web Corpus

06/18/2020
by   Luke Gessler, et al.
0

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a "better than NLP" benchmark and evaluate the accuracy of the resulting resource.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2020

Establishing a New State-of-the-Art for French Named Entity Recognition

The French TreeBank developed at the University Paris 7 is the main sour...
research
09/19/2023

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Natural language processing (NLP) applications such as named entity reco...
research
12/14/2022

Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study

This paper presents a corpus annotated for the task of direct-speech ext...
research
09/02/2019

All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations

We describe and evaluate different approaches to the conversion of gold ...
research
10/04/2017

Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

We present DepCC, the largest to date linguistically analyzed corpus in ...
research
06/05/2020

Prague Dependency Treebank – Consolidated 1.0

We present a richly annotated and genre-diversified language resource, t...
research
04/26/2022

Disambiguation of morpho-syntactic features of African American English – the case of habitual be

Recent research has highlighted that natural language processing (NLP) s...

Please sign up or login with your details

Forgot password? Click here to reset