Multilingual Language Processing From Bytes

12/01/2015
by   Dan Gillick, et al.
0

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Large multilingual language models typically rely on a single vocabulary...
research
12/15/2019

Multilingual is not enough: BERT for Finnish

Deep learning-based language models pretrained on large unannotated text...
research
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
research
03/16/2020

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing t...
research
01/14/2022

A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...
research
06/04/2019

Natural Vocabulary Emerges from Free-Form Annotations

We propose an approach for annotating object classes using free-form tex...
research
04/21/2019

A Study on Agreement in PICO Span Annotations

In evidence-based medicine, relevance of medical literature is determine...

Please sign up or login with your details

Forgot password? Click here to reset