Part-of-Speech Tagging for Historical English

03/10/2016
by   Yi Yang, et al.
0

As more historical texts are digitized, there is interest in applying natural language processing tools to these archives. However, the performance of these tools is often unsatisfactory, due to language change and genre differences. Spelling normalization heuristics are the dominant solution for dealing with historical texts, but this approach fails to account for changes in usage and vocabulary. In this empirical paper, we assess the capability of domain adaptation techniques to cope with historical texts, focusing on the classic benchmark task of part-of-speech tagging. We evaluate several domain adaptation methods on the task of tagging Early Modern English and Modern British English texts in the Penn Corpora of Historical English. We demonstrate that the Feature Embedding method for unsupervised domain adaptation outperforms word embeddings and Brown clusters, showing the importance of embedding the entire feature space, rather than just individual words. Feature Embeddings also give better performance than spelling normalization, but the combination of the two methods is better still, yielding a 5 Early Modern English texts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2014

Unsupervised Domain Adaptation with Feature Embeddings

Representation learning is the dominant technique for unsupervised domai...
research
02/24/2020

Parsing Early Modern English for Linguistic Search

We investigate the question of whether advances in NLP over the last few...
research
04/04/2019

Unsupervised Domain Adaptation of Contextualized Embeddings: A Case Study in Early Modern English

Contextualized word embeddings such as ELMo and BERT provide a foundatio...
research
04/04/2019

Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling

Contextualized word embeddings such as ELMo and BERT provide a foundatio...
research
10/25/2016

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Natural-language processing of historical documents is complicated by th...
research
09/30/2016

Modeling Language Change in Historical Corpora: The Case of Portuguese

This paper presents a number of experiments to model changes in a histor...
research
07/29/2019

One-to-X analogical reasoning on word embeddings: a case for diachronic armed conflict prediction from news texts

We extend the well-known word analogy task to a one-to-X formulation, in...

Please sign up or login with your details

Forgot password? Click here to reset