What to do about non-standard (or non-canonical) language in NLP

08/28/2016
by   Barbara Plank, et al.
0

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technologies to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio-demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. In this paper, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. If we embrace the variety of this heterogeneous data by combining it with proper algorithms, we will not only produce more robust models, but will also enable adaptive language technology capable of addressing natural language variation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2018

Modular Mechanistic Networks: On Bridging Mechanistic and Phenomenological Models with Deep Neural Networks in Natural Language Processing

Natural language processing (NLP) can be done using either top-down (the...
research
03/20/2023

Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards

Leaderboard systems allow researchers to objectively evaluate Natural La...
research
05/07/2023

LatinCy: Synthetic Trained Pipelines for Latin NLP

This paper introduces LatinCy, a set of trained general purpose Latin-la...
research
04/19/2020

The Cost of Training NLP Models: A Concise Overview

We review the cost of training large-scale language models, and the driv...
research
04/17/2023

Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing

Dual use, the intentional, harmful reuse of technology and scientific ar...
research
07/31/2019

Normalyzing Numeronyms -- A NLP approach

This paper presents a method to apply Natural Language Processing for no...
research
06/22/2022

Enhancing Networking Cipher Algorithms with Natural Language

This work provides a survey of several networking cipher algorithms and ...

Please sign up or login with your details

Forgot password? Click here to reset