Incorporating Context into Subword Vocabularies

10/13/2022
by   Shaked Yehezkel, et al.
0

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.

READ FULL TEXT
research
02/22/2021

Evaluating Contextualized Language Models for Hungarian

We present an extended comparison of contextualized language models for ...
research
04/26/2023

Impact of Position Bias on Language Models in Token Classification

Language Models (LMs) have shown state-of-the-art performance in Natural...
research
10/31/2016

Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity

In this paper we describe an end to end Neural Model for Named Entity Re...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
04/19/2022

Impact of Tokenization on Language Models: An Analysis for Turkish

Tokenization is an important text preprocessing step to prepare input to...
research
09/08/2021

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish b...
research
01/06/2023

OPD@NL4Opt: An ensemble approach for the NER task of the optimization problem

In this paper, we present an ensemble approach for the NL4Opt competitio...

Please sign up or login with your details

Forgot password? Click here to reset