Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation

11/09/2020
by   Takumi Aoki, et al.
0

We propose a new character-based text classification framework for non-alphabetic languages, such as Chinese and Japanese. Our framework consists of a variational character encoder (VCE) and character-level text classifier. The VCE is composed of a β-variational auto-encoder (β-VAE) that learns the proposed glyph-aware disentangled character embedding (GDCE). Since our GDCE provides zero-mean unit-variance character embeddings that are dimensionally independent, it is applicable for our interpretable data augmentation, namely, semantic sub-character augmentation (SSA). In this paper, we evaluated our framework using Japanese text classification tasks at the document- and sentence-level. We confirmed that our GDCE and SSA not only provided embedding interpretability but also improved the classification performance. Our proposal achieved a competitive result to the state-of-the-art model while also providing model interpretability. Our code is available on https://github.com/IyatomiLab/GDCE-SSA

READ FULL TEXT
research
11/14/2016

Character-level Convolutional Network for Text Classification Applied to Chinese Corpus

This article provides an interesting exploration of character-level conv...
research
10/08/2018

End-to-End Text Classification via Image-based Embedding using Character-level Networks

For analysing and/or understanding languages having no word boundaries b...
research
11/03/2020

CharBERT: Character-aware Pre-trained Language Model

Most pre-trained language models (PLMs) construct word representations a...
research
08/10/2017

Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese

The character vocabulary can be very large in non-alphabetic languages s...
research
06/20/2020

AraDIC: Arabic Document Classification using Image-Based Character Embeddings and Class-Balanced Loss

Classical and some deep learning techniques for Arabic text classificati...
research
02/18/2023

RetVec: Resilient and Efficient Text Vectorizer

This paper describes RetVec, a resilient multilingual embedding scheme d...
research
06/15/2021

SSMix: Saliency-Based Span Mixup for Text Classification

Data augmentation with mixup has shown to be effective on various comput...

Please sign up or login with your details

Forgot password? Click here to reset