Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

08/13/2018
by   Apostolos Kemos, et al.
0

Character-level models of tokens have been shown to be effective at dealing with within-token noise and out-of-vocabulary words. But these models still rely on correct token boundaries. In this paper, we propose a novel end-to-end character-level model and demonstrate its effectiveness in multilingual settings and when token boundaries are noisy. Our model is a semi-Markov conditional random field with neural networks for character and segment representation. It requires no tokenizer. The model matches state-of-the-art baselines for various languages and significantly outperforms them on a noisy English version of a part-of-speech tagging benchmark dataset.

READ FULL TEXT
research
08/01/2021

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Commonly-used transformer language models depend on a tokenization schem...
research
05/21/2018

Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings

The rise of neural networks, and particularly recurrent neural networks,...
research
11/18/2015

Segmental Recurrent Neural Networks

We introduce segmental recurrent neural networks (SRNNs) which define, g...
research
03/12/2019

Character Eyes: Seeing Language through Character-Level Taggers

Character-level models have been used extensively in recent years in NLP...
research
06/23/2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

State-of-the-art models in natural language processing rely on separate ...
research
11/17/2022

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but r...
research
05/09/2023

What is the best recipe for character-level encoder-only modelling?

This paper aims to benchmark recent progress in language understanding m...

Please sign up or login with your details

Forgot password? Click here to reset