From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

05/23/2023
by   Li Sun, et al.
0

Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model's robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2018

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

We show how to deploy recurrent neural networks within a hierarchical Ba...
research
10/20/2020

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Due to the compelling improvements brought by BERT, many recent represen...
research
12/20/2021

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-wo...
research
10/21/2022

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

Sanskrit Word Segmentation (SWS) is essential in making digitized texts ...
research
11/23/2022

Word-Level Representation From Bytes For Language Modeling

Modern language models mostly take sub-words as input, a design that bal...
research
03/15/2021

Sent2Matrix: Folding Character Sequences in Serpentine Manifolds for Two-Dimensional Sentence

We study text representation methods using deep models. Current methods,...
research
03/19/2020

Temporal Embeddings and Transformer Models for Narrative Text Understanding

We present two deep learning approaches to narrative text understanding ...

Please sign up or login with your details

Forgot password? Click here to reset