Semantic Tokenizer for Enhanced Natural Language Processing

04/24/2023
by   sandeep-mehta, et al.
0

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword regularization. We present a novel tokenizer that uses semantics to drive vocabulary construction. The tokenizer includes a trainer that uses stemming to enhance subword formation. Further optimizations and adaptations are implemented to minimize the number of words that cannot be encoded. The encoder is updated to integrate with the trainer. The tokenizer is implemented as a drop-in replacement for the SentencePiece tokenizer. The new tokenizer more than doubles the number of wordforms represented in the vocabulary. The enhanced vocabulary significantly improves NLP model convergence, and improves quality of word and sentence embeddings. Our experimental results show top performance on two Glue tasks using BERT-base, improving on models more than 50X in size.

READ FULL TEXT

page 7

page 8

research
03/15/2022

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

State-of-the-art NLP systems represent inputs with word embeddings, but ...
research
11/17/2020

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

Despite the development of pre-trained language models (PLMs) significan...
research
04/25/2020

All Word Embeddings from One Embedding

In neural network-based models for natural language processing (NLP), th...
research
09/22/2019

Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) sys...
research
10/25/2018

Bayesian Compression for Natural Language Processing

In natural language processing, a lot of the tasks are successfully solv...
research
06/27/2016

Network-Efficient Distributed Word2vec Training System for Large Vocabularies

Word2vec is a popular family of algorithms for unsupervised training of ...
research
05/05/2023

Now It Sounds Like You: Learning Personalized Vocabulary On Device

In recent years, Federated Learning (FL) has shown significant advanceme...

Please sign up or login with your details

Forgot password? Click here to reset