Character Eyes: Seeing Language through Character-Level Taggers

03/12/2019
by   Yuval Pinter, et al.
0

Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations. In one popular architecture, character-level LSTMs are used to feed token representations into a sequence tagger predicting token-level annotations such as part-of-speech (POS) tags. In this work, we examine the behavior of POS taggers across languages from the perspective of individual hidden units within the character LSTM. We aggregate the behavior of these units into language-level metrics which quantify the challenges that taggers face on languages with different morphological properties, and identify links between synthesis and affixation preference and emergent behavior of the hidden tagger layer. In a comparative experiment, we show how modifying the balance between forward and backward hidden units affects model arrangement and performance in these types of languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2021

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Commonly-used transformer language models depend on a tokenization schem...
research
12/10/2016

A Character-Word Compositional Neural Language Model for Finnish

Inspired by recent research, we explore ways to model the highly morphol...
research
08/13/2018

Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

Character-level models of tokens have been shown to be effective at deal...
research
04/26/2017

From Characters to Words to in Between: Do We Capture Morphology?

Words can be represented by composing the representations of subword uni...
research
07/20/2017

A Sub-Character Architecture for Korean Language Processing

We introduce a novel sub-character architecture that exploits a unique c...
research
05/09/2023

What is the best recipe for character-level encoder-only modelling?

This paper aims to benchmark recent progress in language understanding m...
research
04/01/2021

Canonical and Surface Morphological Segmentation for Nguni Languages

Morphological Segmentation involves decomposing words into morphemes, th...

Please sign up or login with your details

Forgot password? Click here to reset