Lost in Space Marking

08/02/2022
by   Cassandra L. Jacobs, et al.
0

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
research
10/23/2020

Topic Modeling with Contextualized Word Representation Clusters

Clustering token-level contextualized word representations produces outp...
research
05/21/2018

Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings

The rise of neural networks, and particularly recurrent neural networks,...
research
06/09/2017

Learning to Embed Words in Context for Syntactic Tasks

We present models for embedding words in the context of surrounding word...
research
12/21/2020

Narrative Incoherence Detection

Motivated by the increasing popularity of intelligent editing assistant,...
research
07/16/2017

End-to-End Information Extraction without Token-Level Supervision

Most state-of-the-art information extraction approaches rely on token-le...
research
02/18/2016

The Interaction of Memory and Attention in Novel Word Generalization: A Computational Investigation

People exhibit a tendency to generalize a novel noun to the basic-level ...

Please sign up or login with your details

Forgot password? Click here to reset