Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

12/15/2021
by   Seth Kulick, et al.
0

We present the first parsing results on the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.9 million word treebank that is an important resource for research in syntactic change. We describe key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank. We present results for this corpus using a modified version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al (2006). Despite its simplicity, this approach works surprisingly well, suggesting it is possible to recover the original structure with sufficient accuracy to support linguistic applications (e.g., searching for syntactic structures of interest). However, for a subset of function tags (e.g., the tag indicating direct speech), additional work is needed, and we discuss some further limits of this approach. The resulting parser will be used to parse Early English Books Online, a 1.1 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Parsing Early Modern English for Linguistic Search

We investigate the question of whether advances in NLP over the last few...
research
10/01/2019

Specializing Word Embeddings (for Parsing) by Information Bottleneck

Pre-trained word embeddings like ELMo and BERT contain rich syntactic an...
research
06/01/2016

Improved Parsing for Argument-Clusters Coordination

Syntactic parsers perform poorly in prediction of Argument-Cluster Coord...
research
04/06/2019

Speeding Up Natural Language Parsing by Reusing Partial Results

This paper proposes a novel technique that applies case-based reasoning ...
research
04/17/2019

Neural Constituency Parsing of Speech Transcripts

This paper studies the performance of a neural self-attentive parser on ...
research
05/26/2021

Prosodic segmentation for parsing spoken dialogue

Parsing spoken dialogue poses unique difficulties, including disfluencie...
research
06/05/2022

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

In stylometric investigations, frequencies of the most frequent words (M...

Please sign up or login with your details

Forgot password? Click here to reset