Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

04/21/2023
by   Tatsuya Hiraoka, et al.
0

This paper proposes a method to optimize tokenization for the performance improvement of already trained downstream models. Our method generates tokenization results attaining lower loss values of a given downstream model on the training data for restricting vocabularies and trains a tokenizer reproducing the tokenization results. Therefore, our method can be applied to variety of tokenization methods, while existing work cannot due to the simultaneous learning of the tokenizer and the downstream model. This paper proposes an example of the BiLSTM-based tokenizer with vocabulary restriction, which can capture wider contextual information for the tokenization process than non-neural-based tokenization methods used in existing work. Experimental results on text classification in Japanese, Chinese, and English text classification tasks show that the proposed method improves performance compared to the existing methods for tokenization optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2021

Joint Optimization of Tokenization and Downstream Model

Since traditional tokenizers are isolated from a downstream task and mod...
research
10/28/2020

A Chinese Text Classification Method With Low Hardware Requirement Based on Improved Model Concatenation

In order to improve the accuracy performance of Chinese text classificat...
research
03/28/2019

Resilient Combination of Complementary CNN and RNN Features for Text Classification through Attention and Ensembling

State-of-the-art methods for text classification include several distinc...
research
02/25/2020

Label-guided Learning for Text Classification

Text classification is one of the most important and fundamental tasks i...
research
08/18/2023

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

Retrieval augmentation, which enhances downstream models by a knowledge ...
research
04/29/2021

Recognition and Processing of NATOM

In this paper we show how to process the NOTAM (Notice to Airmen) data o...
research
04/09/2021

BERT-based Chinese Text Classification for Emergency Domain with a Novel Loss Function

This paper proposes an automatic Chinese text categorization method for ...

Please sign up or login with your details

Forgot password? Click here to reset